2019-06-04 10:11:32 +02:00
// SPDX-License-Identifier: GPL-2.0-only
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
/*
* Kernel - based Virtual Machine driver for Linux
*
* This module enables machines with Intel VT - x extensions to run virtual
* machines without emulation or binary translation .
*
* MMU support
*
* Copyright ( C ) 2006 Qumranet , Inc .
2010-10-06 14:23:22 +02:00
* Copyright 2010 Red Hat , Inc . and / or its affiliates .
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
*
* Authors :
* Yaniv Kamay < yaniv @ qumranet . com >
* Avi Kivity < avi @ qumranet . com >
*/
KVM: x86: Unify pr_fmt to use module name for all KVM modules
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks
use consistent formatting across common x86, Intel, and AMD code. In
addition to providing consistent print formatting, using KBUILD_MODNAME,
e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and
SGX and ...) as technologies without generating weird messages, and
without causing naming conflicts with other kernel code, e.g. "SEV: ",
"tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems.
Opportunistically move away from printk() for prints that need to be
modified anyways, e.g. to drop a manual "kvm: " prefix.
Opportunistically convert a few SGX WARNs that are similarly modified to
WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good
that they would fire repeatedly and spam the kernel log without providing
unique information in each print.
Note, defining pr_fmt yields undesirable results for code that uses KVM's
printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem
as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
wrappers is relatively limited in KVM x86 code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Message-Id: <20221130230934.1014142-35-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:18 +00:00
# define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
2007-06-28 14:15:57 -04:00
2010-10-14 11:22:46 +02:00
# include "irq.h"
2020-05-21 05:57:49 +00:00
# include "ioapic.h"
2007-12-14 09:35:10 +08:00
# include "mmu.h"
2020-06-22 13:20:31 -07:00
# include "mmu_internal.h"
2020-10-14 11:26:43 -07:00
# include "tdp_mmu.h"
2010-01-21 15:31:49 +02:00
# include "x86.h"
2009-05-31 22:58:47 +03:00
# include "kvm_cache_regs.h"
2022-09-29 13:20:09 -04:00
# include "smm.h"
2020-02-18 15:29:49 -08:00
# include "kvm_emulate.h"
2023-07-28 18:35:27 -07:00
# include "page_track.h"
2014-05-07 15:32:50 +03:00
# include "cpuid.h"
2020-10-16 10:29:37 -04:00
# include "spte.h"
2007-06-28 14:15:57 -04:00
2007-12-16 11:02:48 +02:00
# include <linux/kvm_host.h>
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
# include <linux/types.h>
# include <linux/string.h>
# include <linux/mm.h>
# include <linux/highmem.h>
2016-07-13 20:19:00 -04:00
# include <linux/moduleparam.h>
# include <linux/export.h>
2007-11-26 14:08:14 +02:00
# include <linux/swap.h>
2008-02-23 11:44:30 -03:00
# include <linux/hugetlb.h>
2008-02-22 12:21:37 -05:00
# include <linux/compiler.h>
2009-12-23 14:35:21 -02:00
# include <linux/srcu.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2017-02-08 18:51:30 +01:00
# include <linux/sched/signal.h>
2010-05-31 14:28:19 +08:00
# include <linux/uaccess.h>
kvm: x86: reduce collisions in mmu_page_hash
When using two-dimensional paging, the mmu_page_hash (which provides
lookups for existing kvm_mmu_page structs), becomes imbalanced; with
too many collisions in buckets 0 and 512. This has been seen to cause
mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
VMs with a large amount of RAM mapped with 4K pages.
The current hash function uses the lower 10 bits of gfn to index into
mmu_page_hash. When doing shadow paging, gfn is the address of the
guest page table being shadow. These tables are 4K-aligned, which
makes the low bits of gfn a good hash. However, with two-dimensional
paging, no guest page tables are being shadowed, so gfn is the base
address that is mapped by the table. Thus page tables (level=1) have
a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
etc. This means hashes will only differ in their 10th bit.
hash_64() provides a better hash. For example, on a VM with ~200G
(99458 direct=1 kvm_mmu_page structs):
hash max_mmu_page_hash_collisions
--------------------------------------------
low 10 bits 49847
hash_64 105
perfect 97
While we're changing the hash, increase the table size by 4x to better
support large VMs (further reduces number of collisions in 200G VM to
29).
Note that hash_64() does not provide a good distribution prior to commit
ef703f49a6c5 ("Eliminate bad hash multipliers from hash_32() and
hash_64()").
Signed-off-by: David Matlack <dmatlack@google.com>
Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 13:58:25 -08:00
# include <linux/hash.h>
2016-12-06 16:46:16 -08:00
# include <linux/kern_levels.h>
2023-01-14 10:39:11 +01:00
# include <linux/kstrtox.h>
2019-11-04 20:26:00 +01:00
# include <linux/kthread.h>
2023-12-26 18:00:00 +00:00
# include <linux/wordpart.h>
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
2007-06-28 14:15:57 -04:00
# include <asm/page.h>
2019-11-20 15:33:57 +01:00
# include <asm/memtype.h>
2007-06-28 14:15:57 -04:00
# include <asm/cmpxchg.h>
2007-11-21 14:08:40 +02:00
# include <asm/io.h>
2021-03-09 14:42:07 -08:00
# include <asm/set_memory.h>
2024-03-04 11:12:28 +01:00
# include <asm/spec-ctrl.h>
2008-11-17 19:03:13 -02:00
# include <asm/vmx.h>
2023-07-28 18:35:27 -07:00
2017-07-13 18:30:40 -07:00
# include "trace.h"
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
2023-06-01 17:58:59 -07:00
static bool nx_hugepage_mitigation_hard_disabled ;
2021-05-27 15:57:51 +08:00
int __read_mostly nx_huge_pages = - 1 ;
2021-10-19 18:06:27 -07:00
static uint __read_mostly nx_huge_pages_recovery_period_ms ;
2019-11-13 15:47:06 +01:00
# ifdef CONFIG_PREEMPT_RT
/* Recovery can cause latency spikes, disable it for PREEMPT_RT. */
static uint __read_mostly nx_huge_pages_recovery_ratio = 0 ;
# else
2019-11-04 20:26:00 +01:00
static uint __read_mostly nx_huge_pages_recovery_ratio = 60 ;
2019-11-13 15:47:06 +01:00
# endif
2019-11-04 12:22:02 +01:00
2023-06-01 17:58:59 -07:00
static int get_nx_huge_pages ( char * buffer , const struct kernel_param * kp ) ;
2019-11-04 12:22:02 +01:00
static int set_nx_huge_pages ( const char * val , const struct kernel_param * kp ) ;
2021-10-19 18:06:27 -07:00
static int set_nx_huge_pages_recovery_param ( const char * val , const struct kernel_param * kp ) ;
2019-11-04 12:22:02 +01:00
2020-10-03 17:18:07 -07:00
static const struct kernel_param_ops nx_huge_pages_ops = {
2019-11-04 12:22:02 +01:00
. set = set_nx_huge_pages ,
2023-06-01 17:58:59 -07:00
. get = get_nx_huge_pages ,
2019-11-04 12:22:02 +01:00
} ;
2021-10-19 18:06:27 -07:00
static const struct kernel_param_ops nx_huge_pages_recovery_param_ops = {
. set = set_nx_huge_pages_recovery_param ,
2019-11-04 20:26:00 +01:00
. get = param_get_uint ,
} ;
2019-11-04 12:22:02 +01:00
module_param_cb ( nx_huge_pages , & nx_huge_pages_ops , & nx_huge_pages , 0644 ) ;
__MODULE_PARM_TYPE ( nx_huge_pages , " bool " ) ;
2021-10-19 18:06:27 -07:00
module_param_cb ( nx_huge_pages_recovery_ratio , & nx_huge_pages_recovery_param_ops ,
2019-11-04 20:26:00 +01:00
& nx_huge_pages_recovery_ratio , 0644 ) ;
__MODULE_PARM_TYPE ( nx_huge_pages_recovery_ratio , " uint " ) ;
2021-10-19 18:06:27 -07:00
module_param_cb ( nx_huge_pages_recovery_period_ms , & nx_huge_pages_recovery_param_ops ,
& nx_huge_pages_recovery_period_ms , 0644 ) ;
__MODULE_PARM_TYPE ( nx_huge_pages_recovery_period_ms , " uint " ) ;
2019-11-04 12:22:02 +01:00
2020-03-20 14:28:28 -07:00
static bool __read_mostly force_flush_and_sync_on_reuse ;
module_param_named ( flush_on_reuse , force_flush_and_sync_on_reuse , bool , 0644 ) ;
2008-02-07 13:47:41 +01:00
/*
* When setting this variable to true it enables Two - Dimensional - Paging
* where the hardware walks 2 page tables :
* 1. the guest - virtual to guest - physical
* 2. while doing 1. it walks guest - physical to host - physical
* If the hardware supports that we don ' t need to do shadow paging .
*/
2008-02-22 12:21:37 -05:00
bool tdp_enabled = false ;
2008-02-07 13:47:41 +01:00
2023-02-13 13:28:44 -08:00
static bool __ro_after_init tdp_mmu_allowed ;
2022-09-21 10:35:37 -07:00
# ifdef CONFIG_X86_64
bool __read_mostly tdp_mmu_enabled = true ;
module_param_named ( tdp_mmu , tdp_mmu_enabled , bool , 0444 ) ;
# endif
2020-07-15 20:41:21 -07:00
static int max_huge_page_level __read_mostly ;
2021-08-18 11:55:47 -05:00
static int tdp_root_level __read_mostly ;
2020-07-15 20:41:22 -07:00
static int max_tdp_level __read_mostly ;
2020-03-02 15:57:03 -08:00
2010-08-22 19:12:48 +08:00
# define PTE_PREFETCH_NUM 8
2009-12-31 12:10:16 +02:00
# include <trace/events/kvm.h>
2021-07-30 18:04:53 -04:00
/* make pte_list_desc fit well in cache lines */
2021-07-30 18:06:02 -04:00
# define PTE_LIST_EXT 14
2012-03-21 23:49:39 +09:00
2021-07-30 18:06:02 -04:00
/*
2023-01-13 20:29:10 +08:00
* struct pte_list_desc is the core data structure used to implement a custom
* list for tracking a set of related SPTEs , e . g . all the SPTEs that map a
* given GFN when used in the context of rmaps . Using a custom list allows KVM
* to optimize for the common case where many GFNs will have at most a handful
* of SPTEs pointing at them , i . e . allows packing multiple SPTEs into a small
* memory footprint , which in turn improves runtime performance by exploiting
* cache locality .
*
* A list is comprised of one or more pte_list_desc objects ( descriptors ) .
* Each individual descriptor stores up to PTE_LIST_EXT SPTEs . If a descriptor
* is full and a new SPTEs needs to be added , a new descriptor is allocated and
* becomes the head of the list . This means that by definitions , all tail
* descriptors are full .
*
* Note , the meta data fields are deliberately placed at the start of the
* structure to optimize the cacheline layout ; accessing the descriptor will
* touch only a single cacheline so long as @ spte_count < = 6 ( or if only the
* descriptors metadata is accessed ) .
2021-07-30 18:06:02 -04:00
*/
2011-05-15 23:26:20 +08:00
struct pte_list_desc {
struct pte_list_desc * more ;
2023-01-13 20:29:10 +08:00
/* The number of PTEs stored in _this_ descriptor. */
u32 spte_count ;
/* The number of PTEs stored in all tails of this descriptor. */
u32 tail_count ;
2021-07-30 18:06:02 -04:00
u64 * sptes [ PTE_LIST_EXT ] ;
2007-01-05 16:36:38 -08:00
} ;
2008-12-25 14:39:47 +02:00
struct kvm_shadow_walk_iterator {
u64 addr ;
hpa_t shadow_addr ;
u64 * sptep ;
2011-07-12 03:32:54 +08:00
int level ;
2008-12-25 14:39:47 +02:00
unsigned index ;
} ;
2018-06-27 14:59:16 -07:00
# define for_each_shadow_entry_using_root(_vcpu, _root, _addr, _walker) \
for ( shadow_walk_init_using_root ( & ( _walker ) , ( _vcpu ) , \
( _root ) , ( _addr ) ) ; \
shadow_walk_okay ( & ( _walker ) ) ; \
shadow_walk_next ( & ( _walker ) ) )
# define for_each_shadow_entry(_vcpu, _addr, _walker) \
2008-12-25 14:39:47 +02:00
for ( shadow_walk_init ( & ( _walker ) , _vcpu , _addr ) ; \
shadow_walk_okay ( & ( _walker ) ) ; \
shadow_walk_next ( & ( _walker ) ) )
2011-07-12 03:32:13 +08:00
# define for_each_shadow_entry_lockless(_vcpu, _addr, _walker, spte) \
for ( shadow_walk_init ( & ( _walker ) , _vcpu , _addr ) ; \
shadow_walk_okay ( & ( _walker ) ) & & \
( { spte = mmu_spte_get_lockless ( _walker . sptep ) ; 1 ; } ) ; \
__shadow_walk_next ( & ( _walker ) , spte ) )
2011-05-15 23:26:20 +08:00
static struct kmem_cache * pte_list_desc_cache ;
2020-10-14 20:26:44 +02:00
struct kmem_cache * mmu_page_header_cache ;
KVM: create aggregate kvm_total_used_mmu_pages value
Of slab shrinkers, the VM code says:
* Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
* querying the cache size, so a fastpath for that case is appropriate.
and it *means* it. Look at how it calls the shrinkers:
nr_before = (*shrinker->shrink)(0, gfp_mask);
shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
So, if you do anything stupid in your shrinker, the VM will doubly
punish you.
The mmu_shrink() function takes the global kvm_lock, then acquires
every VM's kvm->mmu_lock in sequence. If we have 100 VMs, then
we're going to take 101 locks. We do it twice, so each call takes
202 locks. If we're under memory pressure, we can have each cpu
trying to do this. It can get really hairy, and we've seen lock
spinning in mmu_shrink() be the dominant entry in profiles.
This is guaranteed to optimize at least half of those lock
aquisitions away. It removes the need to take any of the locks
when simply trying to count objects.
A 'percpu_counter' can be a large object, but we only have one
of these for the entire system. There are not any better
alternatives at the moment, especially ones that handle CPU
hotplug.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Tim Pepper <lnxninja@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-19 18:11:37 -07:00
static struct percpu_counter kvm_total_used_mmu_pages ;
2007-04-15 16:31:09 +03:00
2011-07-12 03:33:44 +08:00
static void mmu_spte_set ( u64 * sptep , u64 spte ) ;
2021-06-22 10:57:05 -07:00
struct kvm_mmu_role_regs {
const unsigned long cr0 ;
const unsigned long cr4 ;
const u64 efer ;
} ;
2019-07-01 06:22:57 -04:00
# define CREATE_TRACE_POINTS
# include "mmutrace.h"
2021-06-22 10:57:05 -07:00
/*
* Yes , lot ' s of underscores . They ' re a hint that you probably shouldn ' t be
2022-02-14 08:46:24 -05:00
* reading from the role_regs . Once the root_role is constructed , it becomes
2021-06-22 10:57:05 -07:00
* the single source of truth for the MMU ' s state .
*/
# define BUILD_MMU_ROLE_REGS_ACCESSOR(reg, name, flag) \
2022-02-10 07:30:20 -05:00
static inline bool __maybe_unused \
____is_ # # reg # # _ # # name ( const struct kvm_mmu_role_regs * regs ) \
2021-06-22 10:57:05 -07:00
{ \
return ! ! ( regs - > reg & flag ) ; \
}
BUILD_MMU_ROLE_REGS_ACCESSOR ( cr0 , pg , X86_CR0_PG ) ;
BUILD_MMU_ROLE_REGS_ACCESSOR ( cr0 , wp , X86_CR0_WP ) ;
BUILD_MMU_ROLE_REGS_ACCESSOR ( cr4 , pse , X86_CR4_PSE ) ;
BUILD_MMU_ROLE_REGS_ACCESSOR ( cr4 , pae , X86_CR4_PAE ) ;
BUILD_MMU_ROLE_REGS_ACCESSOR ( cr4 , smep , X86_CR4_SMEP ) ;
BUILD_MMU_ROLE_REGS_ACCESSOR ( cr4 , smap , X86_CR4_SMAP ) ;
BUILD_MMU_ROLE_REGS_ACCESSOR ( cr4 , pke , X86_CR4_PKE ) ;
BUILD_MMU_ROLE_REGS_ACCESSOR ( cr4 , la57 , X86_CR4_LA57 ) ;
BUILD_MMU_ROLE_REGS_ACCESSOR ( efer , nx , EFER_NX ) ;
BUILD_MMU_ROLE_REGS_ACCESSOR ( efer , lma , EFER_LMA ) ;
2021-06-22 10:57:10 -07:00
/*
* The MMU itself ( with a valid role ) is the single source of truth for the
* MMU . Do not use the regs used to build the MMU / role , nor the vCPU . The
* regs don ' t account for dependencies , e . g . clearing CR4 bits if CR0 . PG = 1 ,
* and the vCPU may be incorrect / irrelevant .
*/
# define BUILD_MMU_ROLE_ACCESSOR(base_or_ext, reg, name) \
2021-09-06 05:52:23 -04:00
static inline bool __maybe_unused is_ # # reg # # _ # # name ( struct kvm_mmu * mmu ) \
2021-06-22 10:57:10 -07:00
{ \
2022-02-11 06:50:11 -05:00
return ! ! ( mmu - > cpu_role . base_or_ext . reg # # _ # # name ) ; \
2021-06-22 10:57:10 -07:00
}
BUILD_MMU_ROLE_ACCESSOR ( base , cr0 , wp ) ;
BUILD_MMU_ROLE_ACCESSOR ( ext , cr4 , pse ) ;
BUILD_MMU_ROLE_ACCESSOR ( ext , cr4 , smep ) ;
BUILD_MMU_ROLE_ACCESSOR ( ext , cr4 , smap ) ;
BUILD_MMU_ROLE_ACCESSOR ( ext , cr4 , pke ) ;
BUILD_MMU_ROLE_ACCESSOR ( ext , cr4 , la57 ) ;
BUILD_MMU_ROLE_ACCESSOR ( base , efer , nx ) ;
2022-02-10 07:39:50 -05:00
BUILD_MMU_ROLE_ACCESSOR ( ext , efer , lma ) ;
2021-06-22 10:57:10 -07:00
2022-02-10 07:38:51 -05:00
static inline bool is_cr0_pg ( struct kvm_mmu * mmu )
{
return mmu - > cpu_role . base . level > 0 ;
}
static inline bool is_cr4_pae ( struct kvm_mmu * mmu )
{
return ! mmu - > cpu_role . base . has_4_byte_gpte ;
}
2021-06-22 10:57:05 -07:00
static struct kvm_mmu_role_regs vcpu_to_role_regs ( struct kvm_vcpu * vcpu )
{
struct kvm_mmu_role_regs regs = {
. cr0 = kvm_read_cr0_bits ( vcpu , KVM_MMU_CR0_ROLE_BITS ) ,
. cr4 = kvm_read_cr4_bits ( vcpu , KVM_MMU_CR4_ROLE_BITS ) ,
. efer = vcpu - > arch . efer ,
} ;
return regs ;
}
2018-12-06 21:21:08 +08:00
2023-03-22 02:37:26 +01:00
static unsigned long get_guest_cr3 ( struct kvm_vcpu * vcpu )
2018-12-06 21:21:08 +08:00
{
2023-03-22 02:37:26 +01:00
return kvm_read_cr3 ( vcpu ) ;
2018-12-06 21:21:08 +08:00
}
2023-03-22 02:37:26 +01:00
static inline unsigned long kvm_mmu_get_guest_pgd ( struct kvm_vcpu * vcpu ,
struct kvm_mmu * mmu )
2018-12-06 21:21:08 +08:00
{
2023-11-21 08:07:32 -08:00
if ( IS_ENABLED ( CONFIG_MITIGATION_RETPOLINE ) & & mmu - > get_guest_pgd = = get_guest_cr3 )
2023-03-22 02:37:26 +01:00
return kvm_read_cr3 ( vcpu ) ;
2018-12-06 21:21:08 +08:00
2023-03-22 02:37:26 +01:00
return mmu - > get_guest_pgd ( vcpu ) ;
2018-12-06 21:21:08 +08:00
}
2023-04-04 17:31:32 -07:00
static inline bool kvm_available_flush_remote_tlbs_range ( void )
2018-12-06 21:21:08 +08:00
{
2023-10-18 12:23:25 -07:00
# if IS_ENABLED(CONFIG_HYPERV)
2023-04-04 17:31:32 -07:00
return kvm_x86_ops . flush_remote_tlbs_range ;
2023-10-18 12:23:25 -07:00
# else
return false ;
# endif
2018-12-06 21:21:08 +08:00
}
2022-10-10 20:19:15 +08:00
static gfn_t kvm_mmu_page_get_gfn ( struct kvm_mmu_page * sp , int index ) ;
/* Flush the range of guest memory mapped by the given SPTE. */
static void kvm_flush_remote_tlbs_sptep ( struct kvm * kvm , u64 * sptep )
{
struct kvm_mmu_page * sp = sptep_to_sp ( sptep ) ;
gfn_t gfn = kvm_mmu_page_get_gfn ( sp , spte_index ( sptep ) ) ;
kvm_flush_remote_tlbs_gfn ( kvm , gfn , sp - > role . level ) ;
}
2020-02-03 15:09:10 -08:00
static void mark_mmio_spte ( struct kvm_vcpu * vcpu , u64 * sptep , u64 gfn ,
unsigned int access )
{
2021-02-25 12:47:34 -08:00
u64 spte = make_mmio_spte ( vcpu , gfn , access ) ;
2020-02-03 15:09:10 -08:00
2021-02-25 12:47:34 -08:00
trace_mark_mmio_spte ( sptep , gfn , spte ) ;
mmu_spte_set ( sptep , spte ) ;
2011-07-12 03:33:44 +08:00
}
static gfn_t get_mmio_spte_gfn ( u64 spte )
{
KVM: x86: fix L1TF's MMIO GFN calculation
One defense against L1TF in KVM is to always set the upper five bits
of the *legal* physical address in the SPTEs for non-present and
reserved SPTEs, e.g. MMIO SPTEs. In the MMIO case, the GFN of the
MMIO SPTE may overlap with the upper five bits that are being usurped
to defend against L1TF. To preserve the GFN, the bits of the GFN that
overlap with the repurposed bits are shifted left into the reserved
bits, i.e. the GFN in the SPTE will be split into high and low parts.
When retrieving the GFN from the MMIO SPTE, e.g. to check for an MMIO
access, get_mmio_spte_gfn() unshifts the affected bits and restores
the original GFN for comparison. Unfortunately, get_mmio_spte_gfn()
neglects to mask off the reserved bits in the SPTE that were used to
store the upper chunk of the GFN. As a result, KVM fails to detect
MMIO accesses whose GPA overlaps the repurprosed bits, which in turn
causes guest panics and hangs.
Fix the bug by generating a mask that covers the lower chunk of the
GFN, i.e. the bits that aren't shifted by the L1TF mitigation. The
alternative approach would be to explicitly zero the five reserved
bits that are used to store the upper chunk of the GFN, but that
requires additional run-time computation and makes an already-ugly
bit of code even more inscrutable.
I considered adding a WARN_ON_ONCE(low_phys_bits-1 <= PAGE_SHIFT) to
warn if GENMASK_ULL() generated a nonsensical value, but that seemed
silly since that would mean a system that supports VMX has less than
18 bits of physical address space...
Reported-by: Sakari Ailus <sakari.ailus@iki.fi>
Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
Cc: Junaid Shahid <junaids@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Junaid Shahid <junaids@google.com>
Tested-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-25 13:20:00 -07:00
u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask ;
2018-08-14 10:15:34 -07:00
2020-10-30 13:39:55 -04:00
gpa | = ( spte > > SHADOW_NONPRESENT_OR_RSVD_MASK_LEN )
2018-08-14 10:15:34 -07:00
& shadow_nonpresent_or_rsvd_mask ;
return gpa > > PAGE_SHIFT ;
2011-07-12 03:33:44 +08:00
}
static unsigned get_mmio_spte_access ( u64 spte )
{
2019-08-01 13:35:22 -07:00
return spte & shadow_mmio_access_mask ;
2011-07-12 03:33:44 +08:00
}
2015-04-08 15:39:23 +02:00
static bool check_mmio_spte ( struct kvm_vcpu * vcpu , u64 spte )
2013-06-07 16:51:26 +08:00
{
2019-02-05 13:01:16 -08:00
u64 kvm_gen , spte_gen , gen ;
2013-06-07 16:51:27 +08:00
2019-02-05 13:01:16 -08:00
gen = kvm_vcpu_memslots ( vcpu ) - > generation ;
if ( unlikely ( gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS ) )
return false ;
2013-06-07 16:51:27 +08:00
2019-02-05 13:01:16 -08:00
kvm_gen = gen & MMIO_SPTE_GEN_MASK ;
2013-06-07 16:51:27 +08:00
spte_gen = get_mmio_spte_generation ( spte ) ;
trace_check_mmio_spte ( spte , kvm_gen , spte_gen ) ;
return likely ( kvm_gen = = spte_gen ) ;
2013-06-07 16:51:26 +08:00
}
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
static int is_cpuid_PSE36 ( void )
{
return 1 ;
}
2011-07-12 03:31:28 +08:00
# ifdef CONFIG_X86_64
2009-06-10 14:24:23 +03:00
static void __set_spte ( u64 * sptep , u64 spte )
2007-05-31 15:46:04 +03:00
{
2016-05-11 08:04:29 -07:00
WRITE_ONCE ( * sptep , spte ) ;
2007-05-31 15:46:04 +03:00
}
2011-07-12 03:31:28 +08:00
static void __update_clear_spte_fast ( u64 * sptep , u64 spte )
2010-06-06 14:48:06 +03:00
{
2016-05-11 08:04:29 -07:00
WRITE_ONCE ( * sptep , spte ) ;
2011-07-12 03:31:28 +08:00
}
static u64 __update_clear_spte_slow ( u64 * sptep , u64 spte )
{
return xchg ( sptep , spte ) ;
}
2011-07-12 03:32:13 +08:00
static u64 __get_spte_lockless ( u64 * sptep )
{
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-23 14:07:29 -07:00
return READ_ONCE ( * sptep ) ;
2011-07-12 03:32:13 +08:00
}
2010-06-06 14:48:06 +03:00
# else
2011-07-12 03:31:28 +08:00
union split_spte {
struct {
u32 spte_low ;
u32 spte_high ;
} ;
u64 spte ;
} ;
2010-06-06 14:48:06 +03:00
2011-07-12 03:32:13 +08:00
static void count_spte_clear ( u64 * sptep , u64 spte )
{
2020-06-22 13:20:33 -07:00
struct kvm_mmu_page * sp = sptep_to_sp ( sptep ) ;
2011-07-12 03:32:13 +08:00
if ( is_shadow_present_pte ( spte ) )
return ;
/* Ensure the spte is completely set before we increase the count */
smp_wmb ( ) ;
sp - > clear_spte_count + + ;
}
2011-07-12 03:31:28 +08:00
static void __set_spte ( u64 * sptep , u64 spte )
{
union split_spte * ssptep , sspte ;
2010-06-06 14:48:06 +03:00
2011-07-12 03:31:28 +08:00
ssptep = ( union split_spte * ) sptep ;
sspte = ( union split_spte ) spte ;
ssptep - > spte_high = sspte . spte_high ;
/*
* If we map the spte from nonpresent to present , We should store
* the high bits firstly , then set present bit , so cpu can not
* fetch this spte while we are setting the spte .
*/
smp_wmb ( ) ;
2016-05-11 08:04:29 -07:00
WRITE_ONCE ( ssptep - > spte_low , sspte . spte_low ) ;
2010-06-06 14:48:06 +03:00
}
2011-07-12 03:31:28 +08:00
static void __update_clear_spte_fast ( u64 * sptep , u64 spte )
{
union split_spte * ssptep , sspte ;
ssptep = ( union split_spte * ) sptep ;
sspte = ( union split_spte ) spte ;
2016-05-11 08:04:29 -07:00
WRITE_ONCE ( ssptep - > spte_low , sspte . spte_low ) ;
2011-07-12 03:31:28 +08:00
/*
* If we map the spte from present to nonpresent , we should clear
* present bit firstly to avoid vcpu fetch the old high bits .
*/
smp_wmb ( ) ;
ssptep - > spte_high = sspte . spte_high ;
2011-07-12 03:32:13 +08:00
count_spte_clear ( sptep , spte ) ;
2011-07-12 03:31:28 +08:00
}
static u64 __update_clear_spte_slow ( u64 * sptep , u64 spte )
{
union split_spte * ssptep , sspte , orig ;
ssptep = ( union split_spte * ) sptep ;
sspte = ( union split_spte ) spte ;
/* xchg acts as a barrier before the setting of the high bits */
orig . spte_low = xchg ( & ssptep - > spte_low , sspte . spte_low ) ;
2011-09-19 12:19:51 +08:00
orig . spte_high = ssptep - > spte_high ;
ssptep - > spte_high = sspte . spte_high ;
2011-07-12 03:32:13 +08:00
count_spte_clear ( sptep , spte ) ;
2011-07-12 03:31:28 +08:00
return orig . spte ;
}
2011-07-12 03:32:13 +08:00
/*
* The idea using the light way get the spte on x86_32 guest is from
2019-07-11 20:56:49 -07:00
* gup_get_pte ( mm / gup . c ) .
2013-06-19 17:09:20 +08:00
*
2022-07-15 22:42:22 +00:00
* An spte tlb flush may be pending , because kvm_set_pte_rmap
2013-06-19 17:09:20 +08:00
* coalesces them and we are running out of the MMU lock . Therefore
* we need to protect against in - progress updates of the spte .
*
* Reading the spte while an update is in progress may get the old value
* for the high part of the spte . The race is fine for a present - > non - present
* change ( because the high part of the spte is ignored for non - present spte ) ,
* but for a present - > present change we must reread the spte .
*
* All such changes are done in two steps ( present - > non - present and
* non - present - > present ) , hence it is enough to count the number of
* present - > non - present updates : if it changed while reading the spte ,
* we might have hit the race . This is done using clear_spte_count .
2011-07-12 03:32:13 +08:00
*/
static u64 __get_spte_lockless ( u64 * sptep )
{
2020-06-22 13:20:33 -07:00
struct kvm_mmu_page * sp = sptep_to_sp ( sptep ) ;
2011-07-12 03:32:13 +08:00
union split_spte spte , * orig = ( union split_spte * ) sptep ;
int count ;
retry :
count = sp - > clear_spte_count ;
smp_rmb ( ) ;
spte . spte_low = orig - > spte_low ;
smp_rmb ( ) ;
spte . spte_high = orig - > spte_high ;
smp_rmb ( ) ;
if ( unlikely ( spte . spte_low ! = orig - > spte_low | |
count ! = sp - > clear_spte_count ) )
goto retry ;
return spte . spte ;
}
2011-07-12 03:31:28 +08:00
# endif
2011-07-12 03:30:35 +08:00
/* Rules for using mmu_spte_set:
* Set the sptep from nonpresent to present .
* Note : the sptep being assigned * must * be either not present
* or in a state where the hardware will not attempt to update
* the spte .
*/
static void mmu_spte_set ( u64 * sptep , u64 new_spte )
{
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( is_shadow_present_pte ( * sptep ) ) ;
2011-07-12 03:30:35 +08:00
__set_spte ( sptep , new_spte ) ;
}
2016-12-06 16:46:14 -08:00
/*
* Update the SPTE ( excluding the PFN ) , but do not track changes in its
* accessed / dirty status .
2011-07-12 03:30:35 +08:00
*/
2016-12-06 16:46:14 -08:00
static u64 mmu_spte_update_no_track ( u64 * sptep , u64 new_spte )
2010-06-06 15:46:44 +03:00
{
2012-06-20 15:59:18 +08:00
u64 old_spte = * sptep ;
2010-08-02 16:15:08 +08:00
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( ! is_shadow_present_pte ( new_spte ) ) ;
2022-01-25 23:05:15 +00:00
check_spte_writable_invariants ( new_spte ) ;
2010-06-06 15:46:44 +03:00
2012-06-20 15:58:33 +08:00
if ( ! is_shadow_present_pte ( old_spte ) ) {
mmu_spte_set ( sptep , new_spte ) ;
2016-12-06 16:46:14 -08:00
return old_spte ;
2012-06-20 15:58:33 +08:00
}
2010-08-02 16:15:08 +08:00
2012-06-20 15:59:18 +08:00
if ( ! spte_has_volatile_bits ( old_spte ) )
2011-07-12 03:31:28 +08:00
__update_clear_spte_fast ( sptep , new_spte ) ;
2010-08-02 16:15:08 +08:00
else
2011-07-12 03:31:28 +08:00
old_spte = __update_clear_spte_slow ( sptep , new_spte ) ;
2010-08-02 16:15:08 +08:00
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( spte_to_pfn ( old_spte ) ! = spte_to_pfn ( new_spte ) ) ;
2016-12-06 16:46:13 -08:00
2016-12-06 16:46:14 -08:00
return old_spte ;
}
/* Rules for using mmu_spte_update:
* Update the state bits , it means the mapped pfn is not changed .
*
2022-01-25 23:07:23 +00:00
* Whenever an MMU - writable SPTE is overwritten with a read - only SPTE , remote
* TLBs must be flushed . Otherwise rmap_write_protect will find a read - only
* spte , even though the writable spte might be cached on a CPU ' s TLB .
2016-12-06 16:46:14 -08:00
*
* Returns true if the TLB needs to be flushed
*/
static bool mmu_spte_update ( u64 * sptep , u64 new_spte )
{
bool flush = false ;
u64 old_spte = mmu_spte_update_no_track ( sptep , new_spte ) ;
if ( ! is_shadow_present_pte ( old_spte ) )
return false ;
2012-06-20 15:59:18 +08:00
/*
* For the spte updated out of mmu - lock is safe , since
2016-02-23 15:34:30 -08:00
* we always atomically update it , see the comments in
2012-06-20 15:59:18 +08:00
* spte_has_volatile_bits ( ) .
*/
2022-04-23 03:47:41 +00:00
if ( is_mmu_writable_spte ( old_spte ) & &
2014-04-17 17:06:15 +08:00
! is_writable_pte ( new_spte ) )
2016-12-06 16:46:13 -08:00
flush = true ;
2010-08-02 16:15:08 +08:00
2015-01-09 16:44:30 +08:00
/*
2016-12-06 16:46:13 -08:00
* Flush TLB when accessed / dirty states are changed in the page tables ,
2015-01-09 16:44:30 +08:00
* to guarantee consistency between TLB and page tables .
*/
2016-12-06 16:46:13 -08:00
if ( is_accessed_spte ( old_spte ) & & ! is_accessed_spte ( new_spte ) ) {
flush = true ;
2010-08-02 16:15:08 +08:00
kvm_set_pfn_accessed ( spte_to_pfn ( old_spte ) ) ;
2016-12-06 16:46:13 -08:00
}
if ( is_dirty_spte ( old_spte ) & & ! is_dirty_spte ( new_spte ) ) {
flush = true ;
2010-08-02 16:15:08 +08:00
kvm_set_pfn_dirty ( spte_to_pfn ( old_spte ) ) ;
2016-12-06 16:46:13 -08:00
}
2012-06-20 15:58:33 +08:00
2016-12-06 16:46:13 -08:00
return flush ;
2010-06-06 15:46:44 +03:00
}
2011-07-12 03:30:35 +08:00
/*
* Rules for using mmu_spte_clear_track_bits :
* It sets the sptep from present to nonpresent , and track the
* state bits , it is used to clear the last level sptep .
2021-07-02 15:04:51 -07:00
* Returns the old PTE .
2011-07-12 03:30:35 +08:00
*/
2022-07-15 22:42:20 +00:00
static u64 mmu_spte_clear_track_bits ( struct kvm * kvm , u64 * sptep )
2011-07-12 03:30:35 +08:00
{
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 16:56:11 -08:00
kvm_pfn_t pfn ;
2011-07-12 03:30:35 +08:00
u64 old_spte = * sptep ;
2021-08-02 21:46:07 -07:00
int level = sptep_to_sp ( sptep ) - > role . level ;
2022-04-29 01:04:15 +00:00
struct page * page ;
2011-07-12 03:30:35 +08:00
KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits()
Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
and into its callers. Well, caller, since only one of its two callers
doesn't already do the shadow-present check.
Opportunistically move the helper to spte.c/h so that it can be used by
the TDP MMU, which is also the primary motivation for the shadow-present
change. Unlike the legacy MMU, the TDP MMU uses a single path for clear
leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
MMU will need to check is_last_spte() prior to calling
spte_has_volatile_bits(), and calling is_last_spte() without first
calling is_shadow_present_spte() is at best odd, and at worst a violation
of KVM's loosely defines SPTE rules.
Note, mmu_spte_clear_track_bits() could likely skip the write entirely
for SPTEs that are not shadow-present. Leave that cleanup for a future
patch to avoid introducing a functional change, and because the
shadow-present check can likely be moved further up the stack, e.g.
drop_large_spte() appears to be the only path that doesn't already
explicitly check for a shadow-present SPTE.
No functional change intended.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:42 +00:00
if ( ! is_shadow_present_pte ( old_spte ) | |
! spte_has_volatile_bits ( old_spte ) )
2011-07-12 03:31:28 +08:00
__update_clear_spte_fast ( sptep , 0ull ) ;
2011-07-12 03:30:35 +08:00
else
2011-07-12 03:31:28 +08:00
old_spte = __update_clear_spte_slow ( sptep , 0ull ) ;
2011-07-12 03:30:35 +08:00
2015-11-20 17:44:55 +09:00
if ( ! is_shadow_present_pte ( old_spte ) )
2021-07-02 15:04:51 -07:00
return old_spte ;
2011-07-12 03:30:35 +08:00
2021-08-02 21:46:07 -07:00
kvm_update_page_stats ( kvm , level , - 1 ) ;
2011-07-12 03:30:35 +08:00
pfn = spte_to_pfn ( old_spte ) ;
2012-07-17 21:52:52 +08:00
/*
2022-04-29 01:04:15 +00:00
* KVM doesn ' t hold a reference to any pages mapped into the guest , and
* instead uses the mmu_notifier to ensure that KVM unmaps any pages
* before they are reclaimed . Sanity check that , if the pfn is backed
* by a refcounted page , the refcount is elevated .
2012-07-17 21:52:52 +08:00
*/
2022-04-29 01:04:15 +00:00
page = kvm_pfn_to_refcounted_page ( pfn ) ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( page & & ! page_count ( page ) ) ;
2012-07-17 21:52:52 +08:00
2016-12-06 16:46:13 -08:00
if ( is_accessed_spte ( old_spte ) )
2011-07-12 03:30:35 +08:00
kvm_set_pfn_accessed ( pfn ) ;
2016-12-06 16:46:13 -08:00
if ( is_dirty_spte ( old_spte ) )
2011-07-12 03:30:35 +08:00
kvm_set_pfn_dirty ( pfn ) ;
2016-12-06 16:46:13 -08:00
2021-07-02 15:04:51 -07:00
return old_spte ;
2011-07-12 03:30:35 +08:00
}
/*
* Rules for using mmu_spte_clear_no_track :
* Directly clear spte without caring the state bits of sptep ,
* it is used to set the upper level spte .
*/
static void mmu_spte_clear_no_track ( u64 * sptep )
{
2011-07-12 03:31:28 +08:00
__update_clear_spte_fast ( sptep , 0ull ) ;
2011-07-12 03:30:35 +08:00
}
2011-07-12 03:32:13 +08:00
static u64 mmu_spte_get_lockless ( u64 * sptep )
{
return __get_spte_lockless ( sptep ) ;
}
2016-12-06 16:46:16 -08:00
/* Returns the Accessed status of the PTE and resets it at the same time. */
static bool mmu_spte_age ( u64 * sptep )
{
u64 spte = mmu_spte_get_lockless ( sptep ) ;
if ( ! is_accessed_spte ( spte ) )
return false ;
2017-06-30 17:26:31 -07:00
if ( spte_ad_enabled ( spte ) ) {
2016-12-06 16:46:16 -08:00
clear_bit ( ( ffs ( shadow_accessed_mask ) - 1 ) ,
( unsigned long * ) sptep ) ;
} else {
/*
* Capture the dirty status of the page , so that it doesn ' t get
* lost when the SPTE is marked for access tracking .
*/
if ( is_writable_pte ( spte ) )
kvm_set_pfn_dirty ( spte_to_pfn ( spte ) ) ;
spte = mark_spte_for_access_track ( spte ) ;
mmu_spte_update_no_track ( sptep , spte ) ;
}
return true ;
}
2022-10-12 18:16:59 +00:00
static inline bool is_tdp_mmu_active ( struct kvm_vcpu * vcpu )
{
return tdp_mmu_enabled & & vcpu - > arch . mmu - > root_role . direct ;
}
2011-07-12 03:32:13 +08:00
static void walk_shadow_page_lockless_begin ( struct kvm_vcpu * vcpu )
{
2022-10-12 18:16:59 +00:00
if ( is_tdp_mmu_active ( vcpu ) ) {
2021-07-13 22:09:54 +00:00
kvm_tdp_mmu_walk_lockless_begin ( ) ;
} else {
/*
* Prevent page table teardown by making any free - er wait during
* kvm_flush_remote_tlbs ( ) IPI to all active vcpus .
*/
local_irq_disable ( ) ;
2016-03-13 11:10:25 +08:00
2021-07-13 22:09:54 +00:00
/*
* Make sure a following spte read is not reordered ahead of the write
* to vcpu - > mode .
*/
smp_store_mb ( vcpu - > mode , READING_SHADOW_PAGE_TABLES ) ;
}
2011-07-12 03:32:13 +08:00
}
static void walk_shadow_page_lockless_end ( struct kvm_vcpu * vcpu )
{
2022-10-12 18:16:59 +00:00
if ( is_tdp_mmu_active ( vcpu ) ) {
2021-07-13 22:09:54 +00:00
kvm_tdp_mmu_walk_lockless_end ( ) ;
} else {
/*
* Make sure the write to vcpu - > mode is not reordered in front of
* reads to sptes . If it does , kvm_mmu_commit_zap_page ( ) can see us
* OUTSIDE_GUEST_MODE and proceed to free the shadow page table .
*/
smp_store_release ( & vcpu - > mode , OUTSIDE_GUEST_MODE ) ;
local_irq_enable ( ) ;
}
2011-07-12 03:32:13 +08:00
}
2020-07-02 19:35:36 -07:00
static int mmu_topup_memory_caches ( struct kvm_vcpu * vcpu , bool maybe_indirect )
2007-01-05 16:36:53 -08:00
{
2007-01-05 16:36:54 -08:00
int r ;
KVM: x86/mmu: Clean up the gorilla math in mmu_topup_memory_caches()
Clean up the minimums in mmu_topup_memory_caches() to document the
driving mechanisms behind the minimums. Now that encountering an empty
cache is unlikely to trigger BUG_ON(), it is less dangerous to be more
precise when defining the minimums.
For rmaps, the logic is 1 parent PTE per level, plus a single rmap, and
prefetched rmaps. The extra objects in the current '8 + PREFETCH'
minimum came about due to an abundance of paranoia in commit
c41ef344de212 ("KVM: MMU: increase per-vcpu rmap cache alloc size"),
i.e. it could have increased the minimum to 2 rmaps. Furthermore, the
unexpected extra rmap case was killed off entirely by commits
f759e2b4c728c ("KVM: MMU: avoid pte_list_desc running out in
kvm_mmu_pte_write") and f5a1e9f89504f ("KVM: MMU: remove call to
kvm_mmu_pte_write from walk_addr").
For the so called page cache, replace '8' with 2*PT64_ROOT_MAX_LEVEL.
The 2x multiplier is needed because the cache is used for both shadow
pages and gfn arrays for indirect MMUs.
And finally, for page headers, replace '4' with PT64_ROOT_MAX_LEVEL.
Note, KVM now supports 5-level paging, i.e. the old minimums that used a
baseline derived from 4-level paging were technically wrong. But, KVM
always allocates roots in a separate flow, e.g. it's impossible in the
current implementation to actually need 5 new shadow pages in a single
flow. Use PT64_ROOT_MAX_LEVEL unmodified instead of subtracting 1, as
the direct usage is likely more intuitive to uninformed readers, and the
inflated minimum is unlikely to affect functionality in practice.
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-9-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-02 19:35:32 -07:00
/* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
2020-07-02 19:35:37 -07:00
r = kvm_mmu_topup_memory_cache ( & vcpu - > arch . mmu_pte_list_desc_cache ,
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM ) ;
2007-05-30 12:34:53 +03:00
if ( r )
2020-07-02 19:35:28 -07:00
return r ;
2020-07-02 19:35:37 -07:00
r = kvm_mmu_topup_memory_cache ( & vcpu - > arch . mmu_shadow_page_cache ,
PT64_ROOT_MAX_LEVEL ) ;
2007-05-30 12:34:53 +03:00
if ( r )
2020-07-02 19:35:33 -07:00
return r ;
2020-07-02 19:35:36 -07:00
if ( maybe_indirect ) {
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
r = kvm_mmu_topup_memory_cache ( & vcpu - > arch . mmu_shadowed_info_cache ,
2020-07-02 19:35:37 -07:00
PT64_ROOT_MAX_LEVEL ) ;
2020-07-02 19:35:36 -07:00
if ( r )
return r ;
}
2020-07-02 19:35:37 -07:00
return kvm_mmu_topup_memory_cache ( & vcpu - > arch . mmu_page_header_cache ,
PT64_ROOT_MAX_LEVEL ) ;
2007-01-05 16:36:53 -08:00
}
static void mmu_free_memory_caches ( struct kvm_vcpu * vcpu )
{
2020-07-02 19:35:37 -07:00
kvm_mmu_free_memory_cache ( & vcpu - > arch . mmu_pte_list_desc_cache ) ;
kvm_mmu_free_memory_cache ( & vcpu - > arch . mmu_shadow_page_cache ) ;
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
kvm_mmu_free_memory_cache ( & vcpu - > arch . mmu_shadowed_info_cache ) ;
2020-07-02 19:35:37 -07:00
kvm_mmu_free_memory_cache ( & vcpu - > arch . mmu_page_header_cache ) ;
2007-01-05 16:36:53 -08:00
}
2011-05-15 23:26:20 +08:00
static void mmu_free_pte_list_desc ( struct pte_list_desc * pte_list_desc )
2007-01-05 16:36:53 -08:00
{
2011-05-15 23:26:20 +08:00
kmem_cache_free ( pte_list_desc_cache , pte_list_desc ) ;
2007-01-05 16:36:53 -08:00
}
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
static bool sp_has_gptes ( struct kvm_mmu_page * sp ) ;
2010-05-26 16:49:59 +08:00
static gfn_t kvm_mmu_page_get_gfn ( struct kvm_mmu_page * sp , int index )
{
2022-04-20 21:12:04 +08:00
if ( sp - > role . passthrough )
return sp - > gfn ;
2010-05-26 16:49:59 +08:00
if ( ! sp - > role . direct )
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
return sp - > shadowed_translation [ index ] > > PAGE_SHIFT ;
2010-05-26 16:49:59 +08:00
2022-06-14 23:33:25 +00:00
return sp - > gfn + ( index < < ( ( sp - > role . level - 1 ) * SPTE_LEVEL_BITS ) ) ;
2010-05-26 16:49:59 +08:00
}
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
/*
* For leaf SPTEs , fetch the * guest * access permissions being shadowed . Note
* that the SPTE itself may have a more constrained access permissions that
* what the guest enforces . For example , a guest may create an executable
* huge PTE but KVM may disallow execution to mitigate iTLB multihit .
*/
static u32 kvm_mmu_page_get_access ( struct kvm_mmu_page * sp , int index )
2010-05-26 16:49:59 +08:00
{
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
if ( sp_has_gptes ( sp ) )
return sp - > shadowed_translation [ index ] & ACC_ALL ;
2022-04-20 21:12:04 +08:00
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
/*
* For direct MMUs ( e . g . TDP or non - paging guests ) or passthrough SPs ,
* KVM is not shadowing any guest page tables , so the " guest access
* permissions " are just ACC_ALL.
*
* For direct SPs in indirect MMUs ( shadow paging ) , i . e . when KVM
* is shadowing a guest huge page with small pages , the guest access
* permissions being shadowed are the access permissions of the huge
* page .
*
* In both cases , sp - > role . access contains the correct access bits .
*/
return sp - > role . access ;
}
2022-06-24 17:18:07 +00:00
static void kvm_mmu_page_set_translation ( struct kvm_mmu_page * sp , int index ,
gfn_t gfn , unsigned int access )
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
{
if ( sp_has_gptes ( sp ) ) {
sp - > shadowed_translation [ index ] = ( gfn < < PAGE_SHIFT ) | access ;
2019-06-30 08:36:21 -04:00
return ;
}
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
WARN_ONCE ( access ! = kvm_mmu_page_get_access ( sp , index ) ,
" access mismatch under %s page %llx (expected %u, got %u) \n " ,
sp - > role . passthrough ? " passthrough " : " direct " ,
sp - > gfn , kvm_mmu_page_get_access ( sp , index ) , access ) ;
WARN_ONCE ( gfn ! = kvm_mmu_page_get_gfn ( sp , index ) ,
" gfn mismatch under %s page %llx (expected %llx, got %llx) \n " ,
sp - > role . passthrough ? " passthrough " : " direct " ,
sp - > gfn , kvm_mmu_page_get_gfn ( sp , index ) , gfn ) ;
}
2022-06-24 17:18:07 +00:00
static void kvm_mmu_page_set_access ( struct kvm_mmu_page * sp , int index ,
unsigned int access )
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
{
gfn_t gfn = kvm_mmu_page_get_gfn ( sp , index ) ;
kvm_mmu_page_set_translation ( sp , index , gfn , access ) ;
2010-05-26 16:49:59 +08:00
}
2008-02-23 11:44:30 -03:00
/*
2010-12-07 12:59:07 +09:00
* Return the pointer to the large page information for a given gfn ,
* handling slots that are not large page aligned .
2008-02-23 11:44:30 -03:00
*/
2010-12-07 12:59:07 +09:00
static struct kvm_lpage_info * lpage_info_slot ( gfn_t gfn ,
2021-04-01 16:37:24 -07:00
const struct kvm_memory_slot * slot , int level )
2008-02-23 11:44:30 -03:00
{
unsigned long idx ;
2012-02-08 12:59:10 +09:00
idx = gfn_to_index ( gfn , slot - > base_gfn , level ) ;
2012-02-08 13:02:18 +09:00
return & slot - > arch . lpage_info [ level - 2 ] [ idx ] ;
2008-02-23 11:44:30 -03:00
}
2023-10-27 11:22:01 -07:00
/*
* The most significant bit in disallow_lpage tracks whether or not memory
* attributes are mixed , i . e . not identical for all gfns at the current level .
* The lower order bits are used to refcount other cases where a hugepage is
* disallowed , e . g . if KVM has shadow a page table at the gfn .
*/
# define KVM_LPAGE_MIXED_FLAG BIT(31)
2021-07-12 22:33:38 -04:00
static void update_gfn_disallow_lpage_count ( const struct kvm_memory_slot * slot ,
2016-02-24 17:51:07 +08:00
gfn_t gfn , int count )
{
struct kvm_lpage_info * linfo ;
2023-10-27 11:22:01 -07:00
int old , i ;
2016-02-24 17:51:07 +08:00
2020-04-27 17:54:22 -07:00
for ( i = PG_LEVEL_2M ; i < = KVM_MAX_HUGEPAGE_LEVEL ; + + i ) {
2016-02-24 17:51:07 +08:00
linfo = lpage_info_slot ( gfn , slot , i ) ;
2023-10-27 11:22:01 -07:00
old = linfo - > disallow_lpage ;
2016-02-24 17:51:07 +08:00
linfo - > disallow_lpage + = count ;
2023-10-27 11:22:01 -07:00
WARN_ON_ONCE ( ( old ^ linfo - > disallow_lpage ) & KVM_LPAGE_MIXED_FLAG ) ;
2016-02-24 17:51:07 +08:00
}
}
2021-07-12 22:33:38 -04:00
void kvm_mmu_gfn_disallow_lpage ( const struct kvm_memory_slot * slot , gfn_t gfn )
2016-02-24 17:51:07 +08:00
{
update_gfn_disallow_lpage_count ( slot , gfn , 1 ) ;
}
2021-07-12 22:33:38 -04:00
void kvm_mmu_gfn_allow_lpage ( const struct kvm_memory_slot * slot , gfn_t gfn )
2016-02-24 17:51:07 +08:00
{
update_gfn_disallow_lpage_count ( slot , gfn , - 1 ) ;
}
2015-05-19 16:29:22 +02:00
static void account_shadowed ( struct kvm * kvm , struct kvm_mmu_page * sp )
2008-02-23 11:44:30 -03:00
{
2015-05-18 15:03:39 +02:00
struct kvm_memslots * slots ;
2009-07-27 16:30:43 +02:00
struct kvm_memory_slot * slot ;
2015-05-19 16:29:22 +02:00
gfn_t gfn ;
2008-02-23 11:44:30 -03:00
2016-02-24 17:51:14 +08:00
kvm - > arch . indirect_shadow_pages + + ;
2015-05-19 16:29:22 +02:00
gfn = sp - > gfn ;
2015-05-18 15:03:39 +02:00
slots = kvm_memslots_for_spte_role ( kvm , sp - > role ) ;
slot = __gfn_to_memslot ( slots , gfn ) ;
2016-02-24 17:51:14 +08:00
/* the non-leaf shadow pages are keeping readonly. */
2020-04-27 17:54:22 -07:00
if ( sp - > role . level > PG_LEVEL_4K )
2023-07-28 18:35:33 -07:00
return __kvm_write_track_add_gfn ( kvm , slot , gfn ) ;
2016-02-24 17:51:14 +08:00
2016-02-24 17:51:07 +08:00
kvm_mmu_gfn_disallow_lpage ( slot , gfn ) ;
2022-06-22 15:26:56 -04:00
if ( kvm_mmu_slot_gfn_write_protect ( kvm , slot , gfn , PG_LEVEL_4K ) )
2022-10-10 20:19:17 +08:00
kvm_flush_remote_tlbs_gfn ( kvm , gfn , PG_LEVEL_4K ) ;
2008-02-23 11:44:30 -03:00
}
2022-10-19 16:56:14 +00:00
void track_possible_nx_huge_page ( struct kvm * kvm , struct kvm_mmu_page * sp )
2019-11-04 12:22:02 +01:00
{
2022-10-19 16:56:11 +00:00
/*
* If it ' s possible to replace the shadow page with an NX huge page ,
* i . e . if the shadow page is the only thing currently preventing KVM
* from using a huge page , add the shadow page to the list of " to be
* zapped for NX recovery " pages. Note, the shadow page can already be
* on the list if KVM is reusing an existing shadow page , i . e . if KVM
* links a shadow page at multiple points .
*/
2022-10-19 16:56:14 +00:00
if ( ! list_empty ( & sp - > possible_nx_huge_page_link ) )
2019-11-04 12:22:02 +01:00
return ;
+ + kvm - > stat . nx_lpage_splits ;
2022-10-19 16:56:12 +00:00
list_add_tail ( & sp - > possible_nx_huge_page_link ,
& kvm - > arch . possible_nx_huge_pages ) ;
2019-11-04 12:22:02 +01:00
}
2022-10-19 16:56:14 +00:00
static void account_nx_huge_page ( struct kvm * kvm , struct kvm_mmu_page * sp ,
bool nx_huge_page_possible )
{
sp - > nx_huge_page_disallowed = true ;
if ( nx_huge_page_possible )
track_possible_nx_huge_page ( kvm , sp ) ;
2019-11-04 12:22:02 +01:00
}
2015-05-19 16:29:22 +02:00
static void unaccount_shadowed ( struct kvm * kvm , struct kvm_mmu_page * sp )
2008-02-23 11:44:30 -03:00
{
2015-05-18 15:03:39 +02:00
struct kvm_memslots * slots ;
2009-07-27 16:30:43 +02:00
struct kvm_memory_slot * slot ;
2015-05-19 16:29:22 +02:00
gfn_t gfn ;
2008-02-23 11:44:30 -03:00
2016-02-24 17:51:14 +08:00
kvm - > arch . indirect_shadow_pages - - ;
2015-05-19 16:29:22 +02:00
gfn = sp - > gfn ;
2015-05-18 15:03:39 +02:00
slots = kvm_memslots_for_spte_role ( kvm , sp - > role ) ;
slot = __gfn_to_memslot ( slots , gfn ) ;
2020-04-27 17:54:22 -07:00
if ( sp - > role . level > PG_LEVEL_4K )
2023-07-28 18:35:33 -07:00
return __kvm_write_track_remove_gfn ( kvm , slot , gfn ) ;
2016-02-24 17:51:14 +08:00
2016-02-24 17:51:07 +08:00
kvm_mmu_gfn_allow_lpage ( slot , gfn ) ;
2008-02-23 11:44:30 -03:00
}
2022-10-19 16:56:14 +00:00
void untrack_possible_nx_huge_page ( struct kvm * kvm , struct kvm_mmu_page * sp )
2019-11-04 12:22:02 +01:00
{
2022-10-19 16:56:12 +00:00
if ( list_empty ( & sp - > possible_nx_huge_page_link ) )
2022-10-19 16:56:11 +00:00
return ;
2019-11-04 12:22:02 +01:00
- - kvm - > stat . nx_lpage_splits ;
2022-10-19 16:56:12 +00:00
list_del_init ( & sp - > possible_nx_huge_page_link ) ;
2019-11-04 12:22:02 +01:00
}
2022-10-19 16:56:14 +00:00
static void unaccount_nx_huge_page ( struct kvm * kvm , struct kvm_mmu_page * sp )
{
sp - > nx_huge_page_disallowed = false ;
untrack_possible_nx_huge_page ( kvm , sp ) ;
2019-11-04 12:22:02 +01:00
}
2023-02-02 18:27:51 +00:00
static struct kvm_memory_slot * gfn_to_memslot_dirty_bitmap ( struct kvm_vcpu * vcpu ,
gfn_t gfn ,
bool no_dirty_log )
2008-02-23 11:44:30 -03:00
{
struct kvm_memory_slot * slot ;
2011-03-09 15:43:00 +08:00
2015-04-08 15:39:23 +02:00
slot = kvm_vcpu_gfn_to_memslot ( vcpu , gfn ) ;
2020-01-21 16:16:32 +01:00
if ( ! slot | | slot - > flags & KVM_MEMSLOT_INVALID )
return NULL ;
2020-09-30 21:22:26 -04:00
if ( no_dirty_log & & kvm_slot_dirty_track_enabled ( slot ) )
2020-01-21 16:16:32 +01:00
return NULL ;
2011-03-09 15:43:00 +08:00
return slot ;
}
2007-09-27 14:11:22 +02:00
/*
2015-11-20 17:41:28 +09:00
* About rmap_head encoding :
2007-01-05 16:36:38 -08:00
*
2015-11-20 17:41:28 +09:00
* If the bit zero of rmap_head - > val is clear , then it points to the only spte
* in this rmap chain . Otherwise , ( rmap_head - > val & ~ 1 ) points to a struct
2011-05-15 23:26:20 +08:00
* pte_list_desc containing more mappings .
2015-11-20 17:41:28 +09:00
*/
/*
* Returns the number of pointers in the rmap chain , not counting the new one .
2007-01-05 16:36:38 -08:00
*/
2022-06-22 15:27:02 -04:00
static int pte_list_add ( struct kvm_mmu_memory_cache * cache , u64 * spte ,
2015-11-20 17:41:28 +09:00
struct kvm_rmap_head * rmap_head )
2007-01-05 16:36:38 -08:00
{
2011-05-15 23:26:20 +08:00
struct pte_list_desc * desc ;
2021-07-30 18:06:02 -04:00
int count = 0 ;
2007-01-05 16:36:38 -08:00
2015-11-20 17:41:28 +09:00
if ( ! rmap_head - > val ) {
rmap_head - > val = ( unsigned long ) spte ;
} else if ( ! ( rmap_head - > val & 1 ) ) {
2022-06-22 15:27:02 -04:00
desc = kvm_mmu_memory_cache_alloc ( cache ) ;
2015-11-20 17:41:28 +09:00
desc - > sptes [ 0 ] = ( u64 * ) rmap_head - > val ;
2009-06-10 14:24:23 +03:00
desc - > sptes [ 1 ] = spte ;
2021-07-30 18:06:02 -04:00
desc - > spte_count = 2 ;
2023-01-13 20:29:10 +08:00
desc - > tail_count = 0 ;
2015-11-20 17:41:28 +09:00
rmap_head - > val = ( unsigned long ) desc | 1 ;
2010-09-18 08:41:02 +08:00
+ + count ;
2007-01-05 16:36:38 -08:00
} else {
2015-11-20 17:41:28 +09:00
desc = ( struct pte_list_desc * ) ( rmap_head - > val & ~ 1ul ) ;
2023-01-13 20:29:10 +08:00
count = desc - > tail_count + desc - > spte_count ;
/*
* If the previous head is full , allocate a new head descriptor
* as tail descriptors are always kept full .
*/
if ( desc - > spte_count = = PTE_LIST_EXT ) {
desc = kvm_mmu_memory_cache_alloc ( cache ) ;
desc - > more = ( struct pte_list_desc * ) ( rmap_head - > val & ~ 1ul ) ;
desc - > spte_count = 0 ;
desc - > tail_count = count ;
rmap_head - > val = ( unsigned long ) desc | 1 ;
2007-01-05 16:36:38 -08:00
}
2021-07-30 18:06:02 -04:00
desc - > sptes [ desc - > spte_count + + ] = spte ;
2007-01-05 16:36:38 -08:00
}
2009-08-05 15:43:58 -03:00
return count ;
2007-01-05 16:36:38 -08:00
}
2023-07-28 17:47:21 -07:00
static void pte_list_desc_remove_entry ( struct kvm * kvm ,
struct kvm_rmap_head * rmap_head ,
2023-02-02 18:27:51 +00:00
struct pte_list_desc * desc , int i )
2007-01-05 16:36:38 -08:00
{
2023-01-13 20:29:10 +08:00
struct pte_list_desc * head_desc = ( struct pte_list_desc * ) ( rmap_head - > val & ~ 1ul ) ;
int j = head_desc - > spte_count - 1 ;
2007-01-05 16:36:38 -08:00
2023-01-13 20:29:10 +08:00
/*
* The head descriptor should never be empty . A new head is added only
* when adding an entry and the previous head is full , and heads are
* removed ( this flow ) when they become empty .
*/
2023-07-28 17:47:22 -07:00
KVM_BUG_ON_DATA_CORRUPTION ( j < 0 , kvm ) ;
2023-01-13 20:29:10 +08:00
/*
* Replace the to - be - freed SPTE with the last valid entry from the head
* descriptor to ensure that tail descriptors are full at all times .
* Note , this also means that tail_count is stable for each descriptor .
*/
desc - > sptes [ i ] = head_desc - > sptes [ j ] ;
head_desc - > sptes [ j ] = NULL ;
head_desc - > spte_count - - ;
if ( head_desc - > spte_count )
2007-01-05 16:36:38 -08:00
return ;
2023-01-13 20:29:10 +08:00
/*
* The head descriptor is empty . If there are no tail descriptors ,
2024-01-02 18:40:11 -06:00
* nullify the rmap head to mark the list as empty , else point the rmap
2023-01-13 20:29:10 +08:00
* head at the next descriptor , i . e . the new head .
*/
if ( ! head_desc - > more )
2019-12-05 11:40:16 +08:00
rmap_head - > val = 0 ;
2007-01-05 16:36:38 -08:00
else
2023-01-13 20:29:10 +08:00
rmap_head - > val = ( unsigned long ) head_desc - > more | 1 ;
mmu_free_pte_list_desc ( head_desc ) ;
2007-01-05 16:36:38 -08:00
}
2023-07-28 17:47:21 -07:00
static void pte_list_remove ( struct kvm * kvm , u64 * spte ,
struct kvm_rmap_head * rmap_head )
2007-01-05 16:36:38 -08:00
{
2011-05-15 23:26:20 +08:00
struct pte_list_desc * desc ;
2007-01-05 16:36:38 -08:00
int i ;
2023-07-28 17:47:22 -07:00
if ( KVM_BUG_ON_DATA_CORRUPTION ( ! rmap_head - > val , kvm ) )
return ;
if ( ! ( rmap_head - > val & 1 ) ) {
if ( KVM_BUG_ON_DATA_CORRUPTION ( ( u64 * ) rmap_head - > val ! = spte , kvm ) )
return ;
2015-11-20 17:41:28 +09:00
rmap_head - > val = 0 ;
2007-01-05 16:36:38 -08:00
} else {
2015-11-20 17:41:28 +09:00
desc = ( struct pte_list_desc * ) ( rmap_head - > val & ~ 1ul ) ;
2007-01-05 16:36:38 -08:00
while ( desc ) {
2021-07-30 18:06:02 -04:00
for ( i = 0 ; i < desc - > spte_count ; + + i ) {
2009-06-10 14:24:23 +03:00
if ( desc - > sptes [ i ] = = spte ) {
2023-07-28 17:47:21 -07:00
pte_list_desc_remove_entry ( kvm , rmap_head ,
desc , i ) ;
2007-01-05 16:36:38 -08:00
return ;
}
2015-11-20 17:41:28 +09:00
}
2007-01-05 16:36:38 -08:00
desc = desc - > more ;
}
2023-07-28 17:47:22 -07:00
KVM_BUG_ON_DATA_CORRUPTION ( true , kvm ) ;
2007-01-05 16:36:38 -08:00
}
}
2022-07-15 22:42:25 +00:00
static void kvm_zap_one_rmap_spte ( struct kvm * kvm ,
struct kvm_rmap_head * rmap_head , u64 * sptep )
2018-10-04 10:04:23 +08:00
{
2021-08-02 21:46:07 -07:00
mmu_spte_clear_track_bits ( kvm , sptep ) ;
2023-07-28 17:47:21 -07:00
pte_list_remove ( kvm , sptep , rmap_head ) ;
2018-10-04 10:04:23 +08:00
}
2022-07-15 22:42:25 +00:00
/* Return true if at least one SPTE was zapped, false otherwise */
static bool kvm_zap_all_rmap_sptes ( struct kvm * kvm ,
struct kvm_rmap_head * rmap_head )
2021-07-30 18:06:05 -04:00
{
struct pte_list_desc * desc , * next ;
int i ;
if ( ! rmap_head - > val )
return false ;
if ( ! ( rmap_head - > val & 1 ) ) {
2021-08-02 21:46:07 -07:00
mmu_spte_clear_track_bits ( kvm , ( u64 * ) rmap_head - > val ) ;
2021-07-30 18:06:05 -04:00
goto out ;
}
desc = ( struct pte_list_desc * ) ( rmap_head - > val & ~ 1ul ) ;
for ( ; desc ; desc = next ) {
for ( i = 0 ; i < desc - > spte_count ; i + + )
2021-08-02 21:46:07 -07:00
mmu_spte_clear_track_bits ( kvm , desc - > sptes [ i ] ) ;
2021-07-30 18:06:05 -04:00
next = desc - > more ;
mmu_free_pte_list_desc ( desc ) ;
}
out :
/* rmap_head is meaningless now, remember to reset it */
rmap_head - > val = 0 ;
return true ;
}
2021-07-30 18:04:52 -04:00
unsigned int pte_list_count ( struct kvm_rmap_head * rmap_head )
{
struct pte_list_desc * desc ;
if ( ! rmap_head - > val )
return 0 ;
else if ( ! ( rmap_head - > val & 1 ) )
return 1 ;
desc = ( struct pte_list_desc * ) ( rmap_head - > val & ~ 1ul ) ;
2023-01-13 20:29:10 +08:00
return desc - > tail_count + desc - > spte_count ;
2021-07-30 18:04:52 -04:00
}
2021-08-04 22:28:43 +00:00
static struct kvm_rmap_head * gfn_to_rmap ( gfn_t gfn , int level ,
const struct kvm_memory_slot * slot )
2011-05-15 23:26:20 +08:00
{
2012-07-02 17:57:17 +09:00
unsigned long idx ;
2011-05-15 23:26:20 +08:00
2012-07-02 17:57:17 +09:00
idx = gfn_to_index ( gfn , slot - > base_gfn , level ) ;
2020-04-27 17:54:22 -07:00
return & slot - > arch . rmap [ level - PG_LEVEL_4K ] [ idx ] ;
2011-05-15 23:26:20 +08:00
}
static void rmap_remove ( struct kvm * kvm , u64 * spte )
{
2021-08-04 22:28:42 +00:00
struct kvm_memslots * slots ;
struct kvm_memory_slot * slot ;
2011-05-15 23:26:20 +08:00
struct kvm_mmu_page * sp ;
gfn_t gfn ;
2015-11-20 17:41:28 +09:00
struct kvm_rmap_head * rmap_head ;
2011-05-15 23:26:20 +08:00
2020-06-22 13:20:33 -07:00
sp = sptep_to_sp ( spte ) ;
2022-07-12 02:07:22 +00:00
gfn = kvm_mmu_page_get_gfn ( sp , spte_index ( spte ) ) ;
2021-08-04 22:28:42 +00:00
/*
2021-08-13 20:35:00 +00:00
* Unlike rmap_add , rmap_remove does not run in the context of a vCPU
* so we have to determine which memslots to use based on context
* information in sp - > role .
2021-08-04 22:28:42 +00:00
*/
slots = kvm_memslots_for_spte_role ( kvm , sp - > role ) ;
slot = __gfn_to_memslot ( slots , gfn ) ;
2021-08-04 22:28:43 +00:00
rmap_head = gfn_to_rmap ( gfn , sp - > role . level , slot ) ;
2021-08-04 22:28:42 +00:00
2023-07-28 17:47:21 -07:00
pte_list_remove ( kvm , spte , rmap_head ) ;
2011-05-15 23:26:20 +08:00
}
2012-03-21 23:50:34 +09:00
/*
* Used by the following functions to iterate through the sptes linked by a
* rmap . All fields are private and not assumed to be used outside .
*/
struct rmap_iterator {
/* private fields */
struct pte_list_desc * desc ; /* holds the sptep if not NULL */
int pos ; /* index of the sptep */
} ;
/*
* Iteration must be started by this function . This should also be used after
* removing / dropping sptes from the rmap link because in such cases the
2019-12-06 16:20:18 +08:00
* information in the iterator may not be valid .
2012-03-21 23:50:34 +09:00
*
* Returns sptep if found , NULL otherwise .
*/
2015-11-20 17:41:28 +09:00
static u64 * rmap_get_first ( struct kvm_rmap_head * rmap_head ,
struct rmap_iterator * iter )
2012-03-21 23:50:34 +09:00
{
2015-11-20 17:45:44 +09:00
u64 * sptep ;
2015-11-20 17:41:28 +09:00
if ( ! rmap_head - > val )
2012-03-21 23:50:34 +09:00
return NULL ;
2015-11-20 17:41:28 +09:00
if ( ! ( rmap_head - > val & 1 ) ) {
2012-03-21 23:50:34 +09:00
iter - > desc = NULL ;
2015-11-20 17:45:44 +09:00
sptep = ( u64 * ) rmap_head - > val ;
goto out ;
2012-03-21 23:50:34 +09:00
}
2015-11-20 17:41:28 +09:00
iter - > desc = ( struct pte_list_desc * ) ( rmap_head - > val & ~ 1ul ) ;
2012-03-21 23:50:34 +09:00
iter - > pos = 0 ;
2015-11-20 17:45:44 +09:00
sptep = iter - > desc - > sptes [ iter - > pos ] ;
out :
BUG_ON ( ! is_shadow_present_pte ( * sptep ) ) ;
return sptep ;
2012-03-21 23:50:34 +09:00
}
/*
* Must be used with a valid iterator : e . g . after rmap_get_first ( ) .
*
* Returns sptep if found , NULL otherwise .
*/
static u64 * rmap_get_next ( struct rmap_iterator * iter )
{
2015-11-20 17:45:44 +09:00
u64 * sptep ;
2012-03-21 23:50:34 +09:00
if ( iter - > desc ) {
if ( iter - > pos < PTE_LIST_EXT - 1 ) {
+ + iter - > pos ;
sptep = iter - > desc - > sptes [ iter - > pos ] ;
if ( sptep )
2015-11-20 17:45:44 +09:00
goto out ;
2012-03-21 23:50:34 +09:00
}
iter - > desc = iter - > desc - > more ;
if ( iter - > desc ) {
iter - > pos = 0 ;
/* desc->sptes[0] cannot be NULL */
2015-11-20 17:45:44 +09:00
sptep = iter - > desc - > sptes [ iter - > pos ] ;
goto out ;
2012-03-21 23:50:34 +09:00
}
}
return NULL ;
2015-11-20 17:45:44 +09:00
out :
BUG_ON ( ! is_shadow_present_pte ( * sptep ) ) ;
return sptep ;
2012-03-21 23:50:34 +09:00
}
2015-11-20 17:41:28 +09:00
# define for_each_rmap_spte(_rmap_head_, _iter_, _spte_) \
for ( _spte_ = rmap_get_first ( _rmap_head_ , _iter_ ) ; \
2015-11-20 17:45:44 +09:00
_spte_ ; _spte_ = rmap_get_next ( _iter_ ) )
2015-05-13 14:42:20 +08:00
2011-07-12 03:28:04 +08:00
static void drop_spte ( struct kvm * kvm , u64 * sptep )
2010-07-16 11:28:09 +08:00
{
2021-08-02 21:46:07 -07:00
u64 old_spte = mmu_spte_clear_track_bits ( kvm , sptep ) ;
2021-07-02 15:04:51 -07:00
if ( is_shadow_present_pte ( old_spte ) )
2010-10-25 11:58:22 -02:00
rmap_remove ( kvm , sptep ) ;
2010-06-06 14:31:27 +03:00
}
2022-06-22 15:27:10 -04:00
static void drop_large_spte ( struct kvm * kvm , u64 * sptep , bool flush )
2012-06-20 15:57:39 +08:00
{
KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE. This is done by drop_large_spte().
However, dropping the large SPTE is really only necessary before the
sp is installed. While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page(). Move the call
there instead of having it in each and every caller.
To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.
Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:07 -04:00
struct kvm_mmu_page * sp ;
2012-06-20 15:57:39 +08:00
KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE. This is done by drop_large_spte().
However, dropping the large SPTE is really only necessary before the
sp is installed. While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page(). Move the call
there instead of having it in each and every caller.
To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.
Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:07 -04:00
sp = sptep_to_sp ( sptep ) ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( sp - > role . level = = PG_LEVEL_4K ) ;
2018-12-06 21:21:09 +08:00
KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE. This is done by drop_large_spte().
However, dropping the large SPTE is really only necessary before the
sp is installed. While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page(). Move the call
there instead of having it in each and every caller.
To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.
Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:07 -04:00
drop_spte ( kvm , sptep ) ;
2022-06-22 15:27:10 -04:00
if ( flush )
2022-10-10 20:19:15 +08:00
kvm_flush_remote_tlbs_sptep ( kvm , sptep ) ;
2012-06-20 15:57:39 +08:00
}
/*
2012-06-20 15:58:58 +08:00
* Write - protect on the specified @ sptep , @ pt_protect indicates whether
2014-04-17 17:06:14 +08:00
* spte write - protection is caused by protecting shadow page table .
2012-06-20 15:58:58 +08:00
*
2014-09-22 10:31:38 +08:00
* Note : write protection is difference between dirty logging and spte
2012-06-20 15:58:58 +08:00
* protection :
* - for dirty logging , the spte can be set to writable at anytime if
* its dirty bitmap is properly set .
* - for spte protection , the spte can be writable only after unsync - ing
* shadow page .
2012-06-20 15:57:39 +08:00
*
2014-04-17 17:06:14 +08:00
* Return true if tlb need be flushed .
2012-06-20 15:57:39 +08:00
*/
2016-08-02 16:32:37 -04:00
static bool spte_write_protect ( u64 * sptep , bool pt_protect )
2012-06-20 15:57:15 +08:00
{
u64 spte = * sptep ;
2012-06-20 15:58:58 +08:00
if ( ! is_writable_pte ( spte ) & &
2022-04-23 03:47:41 +00:00
! ( pt_protect & & is_mmu_writable_spte ( spte ) ) )
2012-06-20 15:57:15 +08:00
return false ;
2012-06-20 15:58:58 +08:00
if ( pt_protect )
2021-02-25 12:47:43 -08:00
spte & = ~ shadow_mmu_writable_mask ;
2012-06-20 15:57:15 +08:00
spte = spte & ~ PT_WRITABLE_MASK ;
2012-06-20 15:58:58 +08:00
2014-04-17 17:06:14 +08:00
return mmu_spte_update ( sptep , spte ) ;
2012-06-20 15:57:15 +08:00
}
2022-01-19 23:07:23 +00:00
static bool rmap_write_protect ( struct kvm_rmap_head * rmap_head ,
bool pt_protect )
2007-10-16 14:42:30 +02:00
{
2012-03-21 23:50:34 +09:00
u64 * sptep ;
struct rmap_iterator iter ;
2012-06-20 15:57:15 +08:00
bool flush = false ;
2007-01-05 16:36:43 -08:00
2015-11-20 17:41:28 +09:00
for_each_rmap_spte ( rmap_head , & iter , sptep )
2016-08-02 16:32:37 -04:00
flush | = spte_write_protect ( sptep , pt_protect ) ;
2008-03-20 18:17:24 +02:00
2012-06-20 15:57:15 +08:00
return flush ;
2012-03-01 19:31:22 +09:00
}
2016-08-02 16:32:37 -04:00
static bool spte_clear_dirty ( u64 * sptep )
2015-01-28 10:54:24 +08:00
{
u64 spte = * sptep ;
2023-07-28 17:47:16 -07:00
KVM_MMU_WARN_ON ( ! spte_ad_enabled ( spte ) ) ;
2015-01-28 10:54:24 +08:00
spte & = ~ shadow_dirty_mask ;
return mmu_spte_update ( sptep , spte ) ;
}
2019-09-26 18:47:59 +02:00
static bool spte_wrprot_for_clear_dirty ( u64 * sptep )
2017-06-30 17:26:31 -07:00
{
bool was_writable = test_and_clear_bit ( PT_WRITABLE_SHIFT ,
( unsigned long * ) sptep ) ;
2019-09-26 18:47:59 +02:00
if ( was_writable & & ! spte_ad_enabled ( * sptep ) )
2017-06-30 17:26:31 -07:00
kvm_set_pfn_dirty ( spte_to_pfn ( * sptep ) ) ;
return was_writable ;
}
/*
* Gets the GFN ready for another round of dirty logging by clearing the
* - D bit on ad - enabled SPTEs , and
* - W bit on ad - disabled SPTEs .
* Returns true iff any D or W bits were cleared .
*/
2021-02-12 16:50:05 -08:00
static bool __rmap_clear_dirty ( struct kvm * kvm , struct kvm_rmap_head * rmap_head ,
2021-07-12 22:33:38 -04:00
const struct kvm_memory_slot * slot )
2015-01-28 10:54:24 +08:00
{
u64 * sptep ;
struct rmap_iterator iter ;
bool flush = false ;
2015-11-20 17:41:28 +09:00
for_each_rmap_spte ( rmap_head , & iter , sptep )
2019-09-26 18:47:59 +02:00
if ( spte_ad_need_write_protect ( * sptep ) )
flush | = spte_wrprot_for_clear_dirty ( sptep ) ;
2017-06-30 17:26:31 -07:00
else
2019-09-26 18:47:59 +02:00
flush | = spte_clear_dirty ( sptep ) ;
2015-01-28 10:54:24 +08:00
return flush ;
}
2012-03-01 19:32:16 +09:00
/**
2015-01-28 10:54:23 +08:00
* kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
2012-03-01 19:32:16 +09:00
* @ kvm : kvm instance
* @ slot : slot to protect
* @ gfn_offset : start of the BITS_PER_LONG pages we care about
* @ mask : indicates which pages we should protect
*
2021-04-29 11:41:15 +08:00
* Used when we do not need to care about huge page mappings .
2012-03-01 19:32:16 +09:00
*/
2015-01-28 10:54:23 +08:00
static void kvm_mmu_write_protect_pt_masked ( struct kvm * kvm ,
2012-03-01 19:32:16 +09:00
struct kvm_memory_slot * slot ,
gfn_t gfn_offset , unsigned long mask )
2012-03-01 19:31:22 +09:00
{
2015-11-20 17:41:28 +09:00
struct kvm_rmap_head * rmap_head ;
2012-03-01 19:31:22 +09:00
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled )
2020-10-14 11:26:55 -07:00
kvm_tdp_mmu_clear_dirty_pt_masked ( kvm , slot ,
slot - > base_gfn + gfn_offset , mask , true ) ;
2021-05-18 10:34:13 -07:00
if ( ! kvm_memslots_have_rmaps ( kvm ) )
return ;
2012-03-01 19:32:16 +09:00
while ( mask ) {
2021-08-04 22:28:43 +00:00
rmap_head = gfn_to_rmap ( slot - > base_gfn + gfn_offset + __ffs ( mask ) ,
PG_LEVEL_4K , slot ) ;
2022-01-19 23:07:23 +00:00
rmap_write_protect ( rmap_head , false ) ;
2008-02-23 11:44:30 -03:00
2012-03-01 19:32:16 +09:00
/* clear the first set bit */
mask & = mask - 1 ;
}
2007-01-05 16:36:43 -08:00
}
2015-01-28 10:54:24 +08:00
/**
2017-06-30 17:26:31 -07:00
* kvm_mmu_clear_dirty_pt_masked - clear MMU D - bit for PT level pages , or write
* protect the page if the D - bit isn ' t supported .
2015-01-28 10:54:24 +08:00
* @ kvm : kvm instance
* @ slot : slot to clear D - bit
* @ gfn_offset : start of the BITS_PER_LONG pages we care about
* @ mask : indicates which pages we should clear D - bit
*
* Used for PML to re - log the dirty GPAs after userspace querying dirty_bitmap .
*/
2021-02-12 16:50:10 -08:00
static void kvm_mmu_clear_dirty_pt_masked ( struct kvm * kvm ,
struct kvm_memory_slot * slot ,
gfn_t gfn_offset , unsigned long mask )
2015-01-28 10:54:24 +08:00
{
2015-11-20 17:41:28 +09:00
struct kvm_rmap_head * rmap_head ;
2015-01-28 10:54:24 +08:00
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled )
2020-10-14 11:26:55 -07:00
kvm_tdp_mmu_clear_dirty_pt_masked ( kvm , slot ,
slot - > base_gfn + gfn_offset , mask , false ) ;
2021-05-18 10:34:13 -07:00
if ( ! kvm_memslots_have_rmaps ( kvm ) )
return ;
2015-01-28 10:54:24 +08:00
while ( mask ) {
2021-08-04 22:28:43 +00:00
rmap_head = gfn_to_rmap ( slot - > base_gfn + gfn_offset + __ffs ( mask ) ,
PG_LEVEL_4K , slot ) ;
2021-02-12 16:50:05 -08:00
__rmap_clear_dirty ( kvm , rmap_head , slot ) ;
2015-01-28 10:54:24 +08:00
/* clear the first set bit */
mask & = mask - 1 ;
}
}
2015-01-28 10:54:23 +08:00
/**
* kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
* PT level pages .
*
* It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
* enable dirty logging for them .
*
2021-04-29 11:41:15 +08:00
* We need to care about huge page mappings : e . g . during dirty logging we may
* have such mappings .
2015-01-28 10:54:23 +08:00
*/
void kvm_arch_mmu_enable_log_dirty_pt_masked ( struct kvm * kvm ,
struct kvm_memory_slot * slot ,
gfn_t gfn_offset , unsigned long mask )
{
2021-04-29 11:41:15 +08:00
/*
* Huge pages are NOT write protected when we start dirty logging in
* initially - all - set mode ; must write protect them here so that they
* are split to 4 K on the first write .
*
* The gfn_offset is guaranteed to be aligned to 64 , but the base_gfn
* of memslot has no such restriction , so the range can cross two large
* pages .
*/
if ( kvm_dirty_log_manual_protect_and_init_set ( kvm ) ) {
gfn_t start = slot - > base_gfn + gfn_offset + __ffs ( mask ) ;
gfn_t end = slot - > base_gfn + gfn_offset + __fls ( mask ) ;
2022-01-19 23:07:37 +00:00
if ( READ_ONCE ( eager_page_split ) )
2023-10-27 10:26:38 -07:00
kvm_mmu_try_split_huge_pages ( kvm , slot , start , end + 1 , PG_LEVEL_4K ) ;
2022-01-19 23:07:37 +00:00
2021-04-29 11:41:15 +08:00
kvm_mmu_slot_gfn_write_protect ( kvm , slot , start , PG_LEVEL_2M ) ;
/* Cross two large pages? */
if ( ALIGN ( start < < PAGE_SHIFT , PMD_SIZE ) ! =
ALIGN ( end < < PAGE_SHIFT , PMD_SIZE ) )
kvm_mmu_slot_gfn_write_protect ( kvm , slot , end ,
PG_LEVEL_2M ) ;
}
/* Now handle 4K PTEs. */
2021-02-12 16:50:10 -08:00
if ( kvm_x86_ops . cpu_dirty_log_size )
kvm_mmu_clear_dirty_pt_masked ( kvm , slot , gfn_offset , mask ) ;
2015-01-28 10:54:27 +08:00
else
kvm_mmu_write_protect_pt_masked ( kvm , slot , gfn_offset , mask ) ;
2015-01-28 10:54:23 +08:00
}
2020-09-30 21:22:22 -04:00
int kvm_cpu_dirty_log_size ( void )
{
2021-02-12 16:50:09 -08:00
return kvm_x86_ops . cpu_dirty_log_size ;
2020-09-30 21:22:22 -04:00
}
2016-02-24 17:51:08 +08:00
bool kvm_mmu_slot_gfn_write_protect ( struct kvm * kvm ,
2021-04-29 11:41:14 +08:00
struct kvm_memory_slot * slot , u64 gfn ,
int min_level )
2011-11-14 18:24:50 +09:00
{
2015-11-20 17:41:28 +09:00
struct kvm_rmap_head * rmap_head ;
2012-03-01 19:32:16 +09:00
int i ;
2012-06-20 15:56:53 +08:00
bool write_protected = false ;
2011-11-14 18:24:50 +09:00
2021-05-18 10:34:13 -07:00
if ( kvm_memslots_have_rmaps ( kvm ) ) {
for ( i = min_level ; i < = KVM_MAX_HUGEPAGE_LEVEL ; + + i ) {
2021-08-04 22:28:43 +00:00
rmap_head = gfn_to_rmap ( gfn , i , slot ) ;
2022-01-19 23:07:23 +00:00
write_protected | = rmap_write_protect ( rmap_head , true ) ;
2021-05-18 10:34:13 -07:00
}
2012-03-01 19:32:16 +09:00
}
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled )
2020-10-14 11:26:57 -07:00
write_protected | =
2021-04-29 11:41:14 +08:00
kvm_tdp_mmu_write_protect_gfn ( kvm , slot , gfn , min_level ) ;
2020-10-14 11:26:57 -07:00
2012-03-01 19:32:16 +09:00
return write_protected ;
2011-11-14 18:24:50 +09:00
}
2022-01-19 23:07:22 +00:00
static bool kvm_vcpu_write_protect_gfn ( struct kvm_vcpu * vcpu , u64 gfn )
2016-02-24 17:51:08 +08:00
{
struct kvm_memory_slot * slot ;
slot = kvm_vcpu_gfn_to_memslot ( vcpu , gfn ) ;
2021-04-29 11:41:14 +08:00
return kvm_mmu_slot_gfn_write_protect ( vcpu - > kvm , slot , gfn , PG_LEVEL_4K ) ;
2016-02-24 17:51:08 +08:00
}
2022-07-15 22:42:24 +00:00
static bool __kvm_zap_rmap ( struct kvm * kvm , struct kvm_rmap_head * rmap_head ,
const struct kvm_memory_slot * slot )
2008-07-25 16:24:52 +02:00
{
2022-07-15 22:42:25 +00:00
return kvm_zap_all_rmap_sptes ( kvm , rmap_head ) ;
2015-05-13 14:42:25 +08:00
}
2022-07-15 22:42:24 +00:00
static bool kvm_zap_rmap ( struct kvm * kvm , struct kvm_rmap_head * rmap_head ,
struct kvm_memory_slot * slot , gfn_t gfn , int level ,
pte_t unused )
2015-05-13 14:42:25 +08:00
{
2022-07-15 22:42:24 +00:00
return __kvm_zap_rmap ( kvm , rmap_head , slot ) ;
2008-07-25 16:24:52 +02:00
}
2022-07-15 22:42:22 +00:00
static bool kvm_set_pte_rmap ( struct kvm * kvm , struct kvm_rmap_head * rmap_head ,
struct kvm_memory_slot * slot , gfn_t gfn , int level ,
pte_t pte )
2009-09-23 21:47:18 +03:00
{
2012-03-21 23:50:34 +09:00
u64 * sptep ;
struct rmap_iterator iter ;
2021-11-14 22:13:12 +05:30
bool need_flush = false ;
2012-03-21 23:50:34 +09:00
u64 new_spte ;
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 16:56:11 -08:00
kvm_pfn_t new_pfn ;
2009-09-23 21:47:18 +03:00
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( pte_huge ( pte ) ) ;
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
new_pfn = pte_pfn ( pte ) ;
2012-03-21 23:50:34 +09:00
2015-05-13 14:42:20 +08:00
restart :
2015-11-20 17:41:28 +09:00
for_each_rmap_spte ( rmap_head , & iter , sptep ) {
2021-11-14 22:13:12 +05:30
need_flush = true ;
2012-03-21 23:50:34 +09:00
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
if ( pte_write ( pte ) ) {
2022-07-15 22:42:25 +00:00
kvm_zap_one_rmap_spte ( kvm , rmap_head , sptep ) ;
2015-05-13 14:42:20 +08:00
goto restart ;
2009-09-23 21:47:18 +03:00
} else {
2020-09-28 10:17:17 -04:00
new_spte = kvm_mmu_changed_pte_notifier_make_spte (
* sptep , new_pfn ) ;
2012-03-21 23:50:34 +09:00
2021-08-02 21:46:07 -07:00
mmu_spte_clear_track_bits ( kvm , sptep ) ;
2012-03-21 23:50:34 +09:00
mmu_spte_set ( sptep , new_spte ) ;
2009-09-23 21:47:18 +03:00
}
}
2012-03-21 23:50:34 +09:00
2023-04-04 17:31:32 -07:00
if ( need_flush & & kvm_available_flush_remote_tlbs_range ( ) ) {
2022-10-10 20:19:13 +08:00
kvm_flush_remote_tlbs_gfn ( kvm , gfn , level ) ;
2021-11-14 22:13:12 +05:30
return false ;
2018-12-06 21:21:12 +08:00
}
2018-12-06 21:21:11 +08:00
return need_flush ;
2009-09-23 21:47:18 +03:00
}
2015-05-13 14:42:22 +08:00
struct slot_rmap_walk_iterator {
/* input fields. */
2021-07-12 22:33:38 -04:00
const struct kvm_memory_slot * slot ;
2015-05-13 14:42:22 +08:00
gfn_t start_gfn ;
gfn_t end_gfn ;
int start_level ;
int end_level ;
/* output fields. */
gfn_t gfn ;
2015-11-20 17:41:28 +09:00
struct kvm_rmap_head * rmap ;
2015-05-13 14:42:22 +08:00
int level ;
/* private field. */
2015-11-20 17:41:28 +09:00
struct kvm_rmap_head * end_rmap ;
2015-05-13 14:42:22 +08:00
} ;
2023-02-02 18:27:51 +00:00
static void rmap_walk_init_level ( struct slot_rmap_walk_iterator * iterator ,
int level )
2015-05-13 14:42:22 +08:00
{
iterator - > level = level ;
iterator - > gfn = iterator - > start_gfn ;
2021-08-04 22:28:43 +00:00
iterator - > rmap = gfn_to_rmap ( iterator - > gfn , level , iterator - > slot ) ;
iterator - > end_rmap = gfn_to_rmap ( iterator - > end_gfn , level , iterator - > slot ) ;
2015-05-13 14:42:22 +08:00
}
2023-02-02 18:27:51 +00:00
static void slot_rmap_walk_init ( struct slot_rmap_walk_iterator * iterator ,
const struct kvm_memory_slot * slot ,
int start_level , int end_level ,
gfn_t start_gfn , gfn_t end_gfn )
2015-05-13 14:42:22 +08:00
{
iterator - > slot = slot ;
iterator - > start_level = start_level ;
iterator - > end_level = end_level ;
iterator - > start_gfn = start_gfn ;
iterator - > end_gfn = end_gfn ;
rmap_walk_init_level ( iterator , iterator - > start_level ) ;
}
static bool slot_rmap_walk_okay ( struct slot_rmap_walk_iterator * iterator )
{
return ! ! iterator - > rmap ;
}
static void slot_rmap_walk_next ( struct slot_rmap_walk_iterator * iterator )
{
2022-05-02 22:03:47 +00:00
while ( + + iterator - > rmap < = iterator - > end_rmap ) {
2015-05-13 14:42:22 +08:00
iterator - > gfn + = ( 1UL < < KVM_HPAGE_GFN_SHIFT ( iterator - > level ) ) ;
2022-05-02 22:03:47 +00:00
if ( iterator - > rmap - > val )
return ;
2015-05-13 14:42:22 +08:00
}
if ( + + iterator - > level > iterator - > end_level ) {
iterator - > rmap = NULL ;
return ;
}
rmap_walk_init_level ( iterator , iterator - > level ) ;
}
# define for_each_slot_rmap_range(_slot_, _start_level_, _end_level_, \
_start_gfn , _end_gfn , _iter_ ) \
for ( slot_rmap_walk_init ( _iter_ , _slot_ , _start_level_ , \
_end_level_ , _start_gfn , _end_gfn ) ; \
slot_rmap_walk_okay ( _iter_ ) ; \
slot_rmap_walk_next ( _iter_ ) )
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
typedef bool ( * rmap_handler_t ) ( struct kvm * kvm , struct kvm_rmap_head * rmap_head ,
struct kvm_memory_slot * slot , gfn_t gfn ,
int level , pte_t pte ) ;
2021-02-25 17:03:28 -08:00
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
static __always_inline bool kvm_handle_gfn_range ( struct kvm * kvm ,
struct kvm_gfn_range * range ,
rmap_handler_t handler )
2008-07-25 16:24:52 +02:00
{
2015-05-13 14:42:22 +08:00
struct slot_rmap_walk_iterator iterator ;
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
bool ret = false ;
2008-07-25 16:24:52 +02:00
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
for_each_slot_rmap_range ( range - > slot , PG_LEVEL_4K , KVM_MAX_HUGEPAGE_LEVEL ,
range - > start , range - > end - 1 , & iterator )
ret | = handler ( kvm , iterator . rmap , range - > slot , iterator . gfn ,
2023-07-28 17:41:44 -07:00
iterator . level , range - > arg . pte ) ;
2008-07-25 16:24:52 +02:00
2012-07-02 17:58:48 +09:00
return ret ;
2008-07-25 16:24:52 +02:00
}
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
bool kvm_unmap_gfn_range ( struct kvm * kvm , struct kvm_gfn_range * range )
KVM: MMU: Make kvm_handle_hva() handle range of addresses
When guest's memory is backed by THP pages, MMU notifier needs to call
kvm_unmap_hva(), which in turn leads to kvm_handle_hva(), in a loop to
invalidate a range of pages which constitute one huge page:
for each page
for each memslot
if page is in memslot
unmap using rmap
This means although every page in that range is expected to be found in
the same memslot, we are forced to check unrelated memslots many times.
If the guest has more memslots, the situation will become worse.
Furthermore, if the range does not include any pages in the guest's
memory, the loop over the pages will just consume extra time.
This patch, together with the following patches, solves this problem by
introducing kvm_handle_hva_range() which makes the loop look like this:
for each memslot
for each page in memslot
unmap using rmap
In this new processing, the actual work is converted to a loop over rmap
which is much more cache friendly than before.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-02 17:55:48 +09:00
{
2021-05-18 10:34:13 -07:00
bool flush = false ;
2020-10-14 11:26:52 -07:00
2021-05-18 10:34:13 -07:00
if ( kvm_memslots_have_rmaps ( kvm ) )
2022-07-15 22:42:24 +00:00
flush = kvm_handle_gfn_range ( kvm , range , kvm_zap_rmap ) ;
2020-10-14 11:26:52 -07:00
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled )
2021-11-17 17:20:39 +08:00
flush = kvm_tdp_mmu_unmap_gfn_range ( kvm , range , flush ) ;
2020-10-14 11:26:52 -07:00
2023-06-01 18:15:18 -07:00
if ( kvm_x86_ops . set_apic_access_page_addr & &
range - > slot - > id = = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT )
2023-06-01 18:15:17 -07:00
kvm_make_all_cpus_request ( kvm , KVM_REQ_APIC_PAGE_RELOAD ) ;
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
return flush ;
2012-07-02 17:56:33 +09:00
}
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
bool kvm_set_spte_gfn ( struct kvm * kvm , struct kvm_gfn_range * range )
2009-09-23 21:47:18 +03:00
{
2021-05-18 10:34:13 -07:00
bool flush = false ;
2020-10-14 11:26:54 -07:00
2021-05-18 10:34:13 -07:00
if ( kvm_memslots_have_rmaps ( kvm ) )
2022-07-15 22:42:22 +00:00
flush = kvm_handle_gfn_range ( kvm , range , kvm_set_pte_rmap ) ;
2020-10-14 11:26:54 -07:00
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled )
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
flush | = kvm_tdp_mmu_set_spte_gfn ( kvm , range ) ;
2020-10-14 11:26:54 -07:00
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
return flush ;
2008-07-25 16:24:52 +02:00
}
2022-07-15 22:42:22 +00:00
static bool kvm_age_rmap ( struct kvm * kvm , struct kvm_rmap_head * rmap_head ,
struct kvm_memory_slot * slot , gfn_t gfn , int level ,
pte_t unused )
2008-07-25 16:24:52 +02:00
{
2012-03-21 23:50:34 +09:00
u64 * sptep ;
treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-03 13:09:38 -07:00
struct rmap_iterator iter ;
2008-07-25 16:24:52 +02:00
int young = 0 ;
2016-12-06 16:46:16 -08:00
for_each_rmap_spte ( rmap_head , & iter , sptep )
young | = mmu_spte_age ( sptep ) ;
2015-05-13 14:42:20 +08:00
2008-07-25 16:24:52 +02:00
return young ;
}
2022-07-15 22:42:22 +00:00
static bool kvm_test_age_rmap ( struct kvm * kvm , struct kvm_rmap_head * rmap_head ,
struct kvm_memory_slot * slot , gfn_t gfn ,
int level , pte_t unused )
2011-01-13 15:47:10 -08:00
{
2012-03-21 23:50:34 +09:00
u64 * sptep ;
struct rmap_iterator iter ;
2011-01-13 15:47:10 -08:00
2016-12-06 16:46:13 -08:00
for_each_rmap_spte ( rmap_head , & iter , sptep )
if ( is_accessed_spte ( * sptep ) )
2021-11-14 22:13:12 +05:30
return true ;
return false ;
2011-01-13 15:47:10 -08:00
}
2009-08-05 15:43:58 -03:00
# define RMAP_RECYCLE_THRESHOLD 1000
2022-06-22 15:27:02 -04:00
static void __rmap_add ( struct kvm * kvm ,
struct kvm_mmu_memory_cache * cache ,
const struct kvm_memory_slot * slot ,
2022-06-24 17:18:07 +00:00
u64 * spte , gfn_t gfn , unsigned int access )
2009-08-05 15:43:58 -03:00
{
2009-07-27 16:30:44 +02:00
struct kvm_mmu_page * sp ;
2021-08-13 20:35:00 +00:00
struct kvm_rmap_head * rmap_head ;
int rmap_count ;
2009-07-27 16:30:44 +02:00
2020-06-22 13:20:33 -07:00
sp = sptep_to_sp ( spte ) ;
2022-07-12 02:07:22 +00:00
kvm_mmu_page_set_translation ( sp , spte_index ( spte ) , gfn , access ) ;
2022-06-22 15:27:03 -04:00
kvm_update_page_stats ( kvm , sp - > role . level , 1 ) ;
2021-08-04 22:28:43 +00:00
rmap_head = gfn_to_rmap ( gfn , sp - > role . level , slot ) ;
2022-06-22 15:27:02 -04:00
rmap_count = pte_list_add ( cache , spte , rmap_head ) ;
2009-08-05 15:43:58 -03:00
2022-09-07 16:06:57 +08:00
if ( rmap_count > kvm - > stat . max_mmu_rmap_size )
kvm - > stat . max_mmu_rmap_size = rmap_count ;
2021-08-13 20:35:00 +00:00
if ( rmap_count > RMAP_RECYCLE_THRESHOLD ) {
2022-07-15 22:42:25 +00:00
kvm_zap_all_rmap_sptes ( kvm , rmap_head ) ;
2022-10-10 20:19:15 +08:00
kvm_flush_remote_tlbs_gfn ( kvm , gfn , sp - > role . level ) ;
2021-08-13 20:35:00 +00:00
}
2009-08-05 15:43:58 -03:00
}
2022-06-22 15:27:02 -04:00
static void rmap_add ( struct kvm_vcpu * vcpu , const struct kvm_memory_slot * slot ,
2022-06-24 17:18:07 +00:00
u64 * spte , gfn_t gfn , unsigned int access )
2022-06-22 15:27:02 -04:00
{
struct kvm_mmu_memory_cache * cache = & vcpu - > arch . mmu_pte_list_desc_cache ;
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
__rmap_add ( vcpu - > kvm , cache , slot , spte , gfn , access ) ;
2022-06-22 15:27:02 -04:00
}
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
bool kvm_age_gfn ( struct kvm * kvm , struct kvm_gfn_range * range )
2008-07-25 16:24:52 +02:00
{
2021-05-18 10:34:13 -07:00
bool young = false ;
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
2021-05-18 10:34:13 -07:00
if ( kvm_memslots_have_rmaps ( kvm ) )
2022-07-15 22:42:22 +00:00
young = kvm_handle_gfn_range ( kvm , range , kvm_age_rmap ) ;
2020-10-14 11:26:53 -07:00
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled )
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
young | = kvm_tdp_mmu_age_gfn_range ( kvm , range ) ;
2020-10-14 11:26:53 -07:00
return young ;
2008-07-25 16:24:52 +02:00
}
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
bool kvm_test_age_gfn ( struct kvm * kvm , struct kvm_gfn_range * range )
2011-01-13 15:47:10 -08:00
{
2021-05-18 10:34:13 -07:00
bool young = false ;
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
2021-05-18 10:34:13 -07:00
if ( kvm_memslots_have_rmaps ( kvm ) )
2022-07-15 22:42:22 +00:00
young = kvm_handle_gfn_range ( kvm , range , kvm_test_age_rmap ) ;
2020-10-14 11:26:53 -07:00
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled )
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 17:56:50 -07:00
young | = kvm_tdp_mmu_test_age_gfn ( kvm , range ) ;
2020-10-14 11:26:53 -07:00
return young ;
2011-01-13 15:47:10 -08:00
}
2023-07-28 17:47:15 -07:00
static void kvm_mmu_check_sptes_at_free ( struct kvm_mmu_page * sp )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
2023-07-28 17:47:19 -07:00
# ifdef CONFIG_KVM_PROVE_MMU
2023-07-28 17:47:14 -07:00
int i ;
2007-01-05 16:36:50 -08:00
2023-07-28 17:47:14 -07:00
for ( i = 0 ; i < SPTE_ENT_PER_PAGE ; i + + ) {
2023-07-28 17:47:16 -07:00
if ( KVM_MMU_WARN_ON ( is_shadow_present_pte ( sp - > spt [ i ] ) ) )
2023-07-28 17:47:15 -07:00
pr_err_ratelimited ( " SPTE %llx (@ %p) for gfn %llx shadow-present at free " ,
sp - > spt [ i ] , & sp - > spt [ i ] ,
kvm_mmu_page_get_gfn ( sp , i ) ) ;
2023-07-28 17:47:14 -07:00
}
2007-04-25 14:17:25 +08:00
# endif
2023-07-28 17:47:15 -07:00
}
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
KVM: create aggregate kvm_total_used_mmu_pages value
Of slab shrinkers, the VM code says:
* Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
* querying the cache size, so a fastpath for that case is appropriate.
and it *means* it. Look at how it calls the shrinkers:
nr_before = (*shrinker->shrink)(0, gfp_mask);
shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
So, if you do anything stupid in your shrinker, the VM will doubly
punish you.
The mmu_shrink() function takes the global kvm_lock, then acquires
every VM's kvm->mmu_lock in sequence. If we have 100 VMs, then
we're going to take 101 locks. We do it twice, so each call takes
202 locks. If we're under memory pressure, we can have each cpu
trying to do this. It can get really hairy, and we've seen lock
spinning in mmu_shrink() be the dominant entry in profiles.
This is guaranteed to optimize at least half of those lock
aquisitions away. It removes the need to take any of the locks
when simply trying to count objects.
A 'percpu_counter' can be a large object, but we only have one
of these for the entire system. There are not any better
alternatives at the moment, especially ones that handle CPU
hotplug.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Tim Pepper <lnxninja@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-19 18:11:37 -07:00
/*
* This value is the sum of all of the kvm instances ' s
* kvm - > arch . n_used_mmu_pages values . We need a global ,
* aggregate version in order to make the slab shrinker
* faster
*/
2021-08-04 14:46:09 -07:00
static inline void kvm_mod_used_mmu_pages ( struct kvm * kvm , long nr )
KVM: create aggregate kvm_total_used_mmu_pages value
Of slab shrinkers, the VM code says:
* Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
* querying the cache size, so a fastpath for that case is appropriate.
and it *means* it. Look at how it calls the shrinkers:
nr_before = (*shrinker->shrink)(0, gfp_mask);
shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
So, if you do anything stupid in your shrinker, the VM will doubly
punish you.
The mmu_shrink() function takes the global kvm_lock, then acquires
every VM's kvm->mmu_lock in sequence. If we have 100 VMs, then
we're going to take 101 locks. We do it twice, so each call takes
202 locks. If we're under memory pressure, we can have each cpu
trying to do this. It can get really hairy, and we've seen lock
spinning in mmu_shrink() be the dominant entry in profiles.
This is guaranteed to optimize at least half of those lock
aquisitions away. It removes the need to take any of the locks
when simply trying to count objects.
A 'percpu_counter' can be a large object, but we only have one
of these for the entire system. There are not any better
alternatives at the moment, especially ones that handle CPU
hotplug.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Tim Pepper <lnxninja@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-19 18:11:37 -07:00
{
kvm - > arch . n_used_mmu_pages + = nr ;
percpu_counter_add ( & kvm_total_used_mmu_pages , nr ) ;
}
2022-08-23 00:46:37 +00:00
static void kvm_account_mmu_page ( struct kvm * kvm , struct kvm_mmu_page * sp )
{
kvm_mod_used_mmu_pages ( kvm , + 1 ) ;
kvm_account_pgtable_pages ( ( void * ) sp - > spt , + 1 ) ;
}
static void kvm_unaccount_mmu_page ( struct kvm * kvm , struct kvm_mmu_page * sp )
{
kvm_mod_used_mmu_pages ( kvm , - 1 ) ;
kvm_account_pgtable_pages ( ( void * ) sp - > spt , - 1 ) ;
}
2022-06-22 15:26:55 -04:00
static void kvm_mmu_free_shadow_page ( struct kvm_mmu_page * sp )
2007-01-05 16:36:49 -08:00
{
2023-07-28 17:47:15 -07:00
kvm_mmu_check_sptes_at_free ( sp ) ;
2010-06-04 21:53:54 +08:00
hlist_del ( & sp - > hash_link ) ;
2011-07-12 03:27:14 +08:00
list_del ( & sp - > link ) ;
free_page ( ( unsigned long ) sp - > spt ) ;
2013-01-30 16:45:05 +02:00
if ( ! sp - > role . direct )
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
free_page ( ( unsigned long ) sp - > shadowed_translation ) ;
2010-05-13 10:06:02 +08:00
kmem_cache_free ( mmu_page_header_cache , sp ) ;
2007-01-05 16:36:49 -08:00
}
[PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 16:36:43 -08:00
static unsigned kvm_page_table_hashfn ( gfn_t gfn )
{
kvm: x86: reduce collisions in mmu_page_hash
When using two-dimensional paging, the mmu_page_hash (which provides
lookups for existing kvm_mmu_page structs), becomes imbalanced; with
too many collisions in buckets 0 and 512. This has been seen to cause
mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
VMs with a large amount of RAM mapped with 4K pages.
The current hash function uses the lower 10 bits of gfn to index into
mmu_page_hash. When doing shadow paging, gfn is the address of the
guest page table being shadow. These tables are 4K-aligned, which
makes the low bits of gfn a good hash. However, with two-dimensional
paging, no guest page tables are being shadowed, so gfn is the base
address that is mapped by the table. Thus page tables (level=1) have
a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
etc. This means hashes will only differ in their 10th bit.
hash_64() provides a better hash. For example, on a VM with ~200G
(99458 direct=1 kvm_mmu_page structs):
hash max_mmu_page_hash_collisions
--------------------------------------------
low 10 bits 49847
hash_64 105
perfect 97
While we're changing the hash, increase the table size by 4x to better
support large VMs (further reduces number of collisions in 200G VM to
29).
Note that hash_64() does not provide a good distribution prior to commit
ef703f49a6c5 ("Eliminate bad hash multipliers from hash_32() and
hash_64()").
Signed-off-by: David Matlack <dmatlack@google.com>
Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 13:58:25 -08:00
return hash_64 ( gfn , KVM_MMU_HASH_SHIFT ) ;
[PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 16:36:43 -08:00
}
2022-06-22 15:27:02 -04:00
static void mmu_page_add_parent_pte ( struct kvm_mmu_memory_cache * cache ,
2007-11-21 15:28:32 +02:00
struct kvm_mmu_page * sp , u64 * parent_pte )
[PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 16:36:43 -08:00
{
if ( ! parent_pte )
return ;
2022-06-22 15:27:02 -04:00
pte_list_add ( cache , parent_pte , & sp - > parent_ptes ) ;
[PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 16:36:43 -08:00
}
2023-07-28 17:47:21 -07:00
static void mmu_page_remove_parent_pte ( struct kvm * kvm , struct kvm_mmu_page * sp ,
[PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 16:36:43 -08:00
u64 * parent_pte )
{
2023-07-28 17:47:21 -07:00
pte_list_remove ( kvm , parent_pte , & sp - > parent_ptes ) ;
[PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 16:36:43 -08:00
}
2023-07-28 17:47:21 -07:00
static void drop_parent_pte ( struct kvm * kvm , struct kvm_mmu_page * sp ,
2011-05-15 23:28:29 +08:00
u64 * parent_pte )
{
2023-07-28 17:47:21 -07:00
mmu_page_remove_parent_pte ( kvm , sp , parent_pte ) ;
2011-07-12 03:30:35 +08:00
mmu_spte_clear_no_track ( parent_pte ) ;
2011-05-15 23:28:29 +08:00
}
2011-05-15 23:27:08 +08:00
static void mark_unsync ( u64 * spte ) ;
2010-06-11 21:35:15 +08:00
static void kvm_mmu_mark_parents_unsync ( struct kvm_mmu_page * sp )
2008-09-23 13:18:40 -03:00
{
2015-11-26 21:15:38 +09:00
u64 * sptep ;
struct rmap_iterator iter ;
for_each_rmap_spte ( & sp - > parent_ptes , & iter , sptep ) {
mark_unsync ( sptep ) ;
}
2008-09-23 13:18:40 -03:00
}
2011-05-15 23:27:08 +08:00
static void mark_unsync ( u64 * spte )
2008-09-23 13:18:40 -03:00
{
2011-05-15 23:27:08 +08:00
struct kvm_mmu_page * sp ;
2008-09-23 13:18:40 -03:00
2020-06-22 13:20:33 -07:00
sp = sptep_to_sp ( spte ) ;
2022-07-12 02:07:22 +00:00
if ( __test_and_set_bit ( spte_index ( spte ) , sp - > unsync_child_bitmap ) )
2008-09-23 13:18:40 -03:00
return ;
2010-06-11 21:35:15 +08:00
if ( sp - > unsync_children + + )
2008-09-23 13:18:40 -03:00
return ;
2010-06-11 21:35:15 +08:00
kvm_mmu_mark_parents_unsync ( sp ) ;
2008-09-23 13:18:40 -03:00
}
2008-12-01 22:32:02 -02:00
# define KVM_PAGE_ARRAY_NR 16
struct kvm_mmu_pages {
struct mmu_page_and_offset {
struct kvm_mmu_page * sp ;
unsigned int idx ;
} page [ KVM_PAGE_ARRAY_NR ] ;
unsigned int nr ;
} ;
2009-02-21 02:19:13 +01:00
static int mmu_pages_add ( struct kvm_mmu_pages * pvec , struct kvm_mmu_page * sp ,
int idx )
2008-09-23 13:18:39 -03:00
{
2008-12-01 22:32:02 -02:00
int i ;
2008-09-23 13:18:39 -03:00
2008-12-01 22:32:02 -02:00
if ( sp - > unsync )
for ( i = 0 ; i < pvec - > nr ; i + + )
if ( pvec - > page [ i ] . sp = = sp )
return 0 ;
pvec - > page [ pvec - > nr ] . sp = sp ;
pvec - > page [ pvec - > nr ] . idx = idx ;
pvec - > nr + + ;
return ( pvec - > nr = = KVM_PAGE_ARRAY_NR ) ;
}
2015-11-20 17:43:13 +09:00
static inline void clear_unsync_child_bit ( struct kvm_mmu_page * sp , int idx )
{
- - sp - > unsync_children ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( ( int ) sp - > unsync_children < 0 ) ;
2015-11-20 17:43:13 +09:00
__clear_bit ( idx , sp - > unsync_child_bitmap ) ;
}
2008-12-01 22:32:02 -02:00
static int __mmu_unsync_walk ( struct kvm_mmu_page * sp ,
struct kvm_mmu_pages * pvec )
{
int i , ret , nr_unsync_leaf = 0 ;
2008-09-23 13:18:39 -03:00
2011-11-29 14:02:45 +09:00
for_each_set_bit ( i , sp - > unsync_child_bitmap , 512 ) {
2010-06-11 21:34:04 +08:00
struct kvm_mmu_page * child ;
2008-09-23 13:18:39 -03:00
u64 ent = sp - > spt [ i ] ;
2015-11-20 17:43:13 +09:00
if ( ! is_shadow_present_pte ( ent ) | | is_large_pte ( ent ) ) {
clear_unsync_child_bit ( sp , i ) ;
continue ;
}
2010-06-11 21:34:04 +08:00
2022-10-19 16:56:16 +00:00
child = spte_to_child_sp ( ent ) ;
2010-06-11 21:34:04 +08:00
if ( child - > unsync_children ) {
if ( mmu_pages_add ( pvec , child , i ) )
return - ENOSPC ;
ret = __mmu_unsync_walk ( child , pvec ) ;
2015-11-20 17:43:13 +09:00
if ( ! ret ) {
clear_unsync_child_bit ( sp , i ) ;
continue ;
} else if ( ret > 0 ) {
2010-06-11 21:34:04 +08:00
nr_unsync_leaf + = ret ;
2015-11-20 17:43:13 +09:00
} else
2010-06-11 21:34:04 +08:00
return ret ;
} else if ( child - > unsync ) {
nr_unsync_leaf + + ;
if ( mmu_pages_add ( pvec , child , i ) )
return - ENOSPC ;
} else
2015-11-20 17:43:13 +09:00
clear_unsync_child_bit ( sp , i ) ;
2008-09-23 13:18:39 -03:00
}
2008-12-01 22:32:02 -02:00
return nr_unsync_leaf ;
}
2016-02-24 09:46:06 +01:00
# define INVALID_INDEX (-1)
2008-12-01 22:32:02 -02:00
static int mmu_unsync_walk ( struct kvm_mmu_page * sp ,
struct kvm_mmu_pages * pvec )
{
2016-02-23 13:54:25 +01:00
pvec - > nr = 0 ;
2008-12-01 22:32:02 -02:00
if ( ! sp - > unsync_children )
return 0 ;
2016-02-24 09:46:06 +01:00
mmu_pages_add ( pvec , sp , INVALID_INDEX ) ;
2008-12-01 22:32:02 -02:00
return __mmu_unsync_walk ( sp , pvec ) ;
2008-09-23 13:18:39 -03:00
}
static void kvm_unlink_unsync_page ( struct kvm * kvm , struct kvm_mmu_page * sp )
{
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( ! sp - > unsync ) ;
2010-04-28 11:55:06 +08:00
trace_kvm_mmu_sync_page ( sp ) ;
2008-09-23 13:18:39 -03:00
sp - > unsync = 0 ;
- - kvm - > stat . mmu_unsync ;
}
KVM: x86/mmu: Differentiate between nr zapped and list unstable
The return value of kvm_mmu_prepare_zap_page() has evolved to become
overloaded to convey two separate pieces of information. 1) was at
least one page zapped and 2) has the list of MMU pages become unstable.
In it's original incarnation (as kvm_mmu_zap_page()), there was no
return value at all. Commit 0738541396be ("KVM: MMU: awareness of new
kvm_mmu_zap_page behaviour") added a return value in preparation for
commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"). Although
the return value was of type 'int', it was actually used as a boolean
to indicate whether or not active_mmu_pages may have become unstable due
to zapping children. Walking a list with list_for_each_entry_safe()
only protects against deleting/moving the current entry, i.e. zapping a
child page would break iteration due to modifying any number of entries.
Later, commit 60c8aec6e2c9 ("KVM: MMU: use page array in unsync walk")
modified mmu_zap_unsync_children() to return an approximation of the
number of children zapped. This was not intentional, it was simply a
side effect of how the code was written.
The unintented side affect was then morphed into an actual feature by
commit 77662e0028c7 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
path"), which modified kvm_mmu_change_mmu_pages() to use the number of
zapped pages when determining the number of MMU pages in use by the VM.
Finally, commit 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed") added the initial page to the
return value to make its behavior more consistent with what most users
would expect. Incorporating the initial parent page in the return value
of kvm_mmu_zap_page() breaks the original usage of restarting a list
walk on a non-zero return value to handle a potentially unstable list,
i.e. walks will unnecessarily restart when any page is zapped.
Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
return a boolean to indicate that the list may be unstable and move the
number of zapped children to a dedicated parameter. Since the majority
of callers to kvm_mmu_prepare_zap_page() don't care about either return
value, preserve the current definition of kvm_mmu_prepare_zap_page() by
making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page(). This
avoids having to update every call site and also provides cleaner code
for functions that only care about the number of pages zapped.
Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 13:01:35 -08:00
static bool kvm_mmu_prepare_zap_page ( struct kvm * kvm , struct kvm_mmu_page * sp ,
struct list_head * invalid_list ) ;
2010-06-04 21:53:54 +08:00
static void kvm_mmu_commit_zap_page ( struct kvm * kvm ,
struct list_head * invalid_list ) ;
2008-09-23 13:18:39 -03:00
2022-04-20 21:12:03 +08:00
static bool sp_has_gptes ( struct kvm_mmu_page * sp )
{
if ( sp - > role . direct )
return false ;
2022-04-20 21:12:04 +08:00
if ( sp - > role . passthrough )
return false ;
2022-04-20 21:12:03 +08:00
return true ;
}
2020-06-23 12:40:26 -07:00
# define for_each_valid_sp(_kvm, _sp, _list) \
hlist_for_each_entry ( _sp , _list , hash_link ) \
2019-09-12 19:46:03 -07:00
if ( is_obsolete_sp ( ( _kvm ) , ( _sp ) ) ) { \
2016-12-20 15:25:57 -08:00
} else
2013-03-06 16:05:07 +09:00
2022-04-20 21:12:03 +08:00
# define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \
2020-06-23 12:40:26 -07:00
for_each_valid_sp ( _kvm , _sp , \
& ( _kvm ) - > arch . mmu_page_hash [ kvm_page_table_hashfn ( _gfn ) ] ) \
2022-04-20 21:12:03 +08:00
if ( ( _sp ) - > gfn ! = ( _gfn ) | | ! sp_has_gptes ( _sp ) ) { } else
2010-06-04 21:53:07 +08:00
2023-02-16 23:41:08 +08:00
static bool kvm_sync_page_check ( struct kvm_vcpu * vcpu , struct kvm_mmu_page * sp )
{
union kvm_mmu_page_role root_role = vcpu - > arch . mmu - > root_role ;
/*
* Ignore various flags when verifying that it ' s safe to sync a shadow
* page using the current MMU context .
*
* - level : not part of the overall MMU role and will never match as the MMU ' s
* level tracks the root level
* - access : updated based on the new guest PTE
* - quadrant : not part of the overall MMU role ( similar to level )
*/
const union kvm_mmu_page_role sync_role_ign = {
. level = 0xf ,
. access = 0x7 ,
. quadrant = 0x3 ,
. passthrough = 0x1 ,
} ;
/*
* Direct pages can never be unsync , and KVM should never attempt to
* sync a shadow page for a different MMU context , e . g . if the role
* differs then the memslot lookup ( SMM vs . non - SMM ) will be bogus , the
* reserved bits checks will be wrong , etc . . .
*/
2023-02-16 23:41:11 +08:00
if ( WARN_ON_ONCE ( sp - > role . direct | | ! vcpu - > arch . mmu - > sync_spte | |
2023-02-16 23:41:08 +08:00
( sp - > role . word ^ root_role . word ) & ~ sync_role_ign . word ) )
return false ;
return true ;
}
2023-02-17 07:53:21 +08:00
static int kvm_sync_spte ( struct kvm_vcpu * vcpu , struct kvm_mmu_page * sp , int i )
{
if ( ! sp - > spt [ i ] )
return 0 ;
return vcpu - > arch . mmu - > sync_spte ( vcpu , sp , i ) ;
}
2023-02-16 23:41:08 +08:00
static int __kvm_sync_page ( struct kvm_vcpu * vcpu , struct kvm_mmu_page * sp )
{
2023-02-16 23:41:11 +08:00
int flush = 0 ;
int i ;
2023-02-16 23:41:08 +08:00
if ( ! kvm_sync_page_check ( vcpu , sp ) )
return - 1 ;
2023-02-16 23:41:11 +08:00
for ( i = 0 ; i < SPTE_ENT_PER_PAGE ; i + + ) {
2023-02-17 07:53:21 +08:00
int ret = kvm_sync_spte ( vcpu , sp , i ) ;
2023-02-16 23:41:11 +08:00
if ( ret < - 1 )
return - 1 ;
flush | = ret ;
}
/*
* Note , any flush is purely for KVM ' s correctness , e . g . when dropping
* an existing SPTE or clearing W / A / D bits to ensure an mmu_notifier
* unmap or dirty logging event doesn ' t fail to flush . The guest is
* responsible for flushing the TLB to ensure any changes in protection
* bits are recognized , i . e . until the guest flushes or page faults on
* a relevant address , KVM is architecturally allowed to let vCPUs use
* cached translations with the old protection bits .
*/
return flush ;
2023-02-16 23:41:08 +08:00
}
2022-03-15 17:35:13 +08:00
static int kvm_sync_page ( struct kvm_vcpu * vcpu , struct kvm_mmu_page * sp ,
2021-06-22 10:56:57 -07:00
struct list_head * invalid_list )
2008-09-23 13:18:39 -03:00
{
2023-02-16 23:41:08 +08:00
int ret = __kvm_sync_page ( vcpu , sp ) ;
2021-09-18 08:56:32 +08:00
2022-03-15 17:35:13 +08:00
if ( ret < 0 )
2010-06-04 21:55:29 +08:00
kvm_mmu_prepare_zap_page ( vcpu - > kvm , sp , invalid_list ) ;
2022-03-15 17:35:13 +08:00
return ret ;
2008-09-23 13:18:39 -03:00
}
2019-02-05 13:01:20 -08:00
static bool kvm_mmu_remote_flush_or_zap ( struct kvm * kvm ,
struct list_head * invalid_list ,
bool remote_flush )
{
2019-04-12 19:55:41 -07:00
if ( ! remote_flush & & list_empty ( invalid_list ) )
2019-02-05 13:01:20 -08:00
return false ;
if ( ! list_empty ( invalid_list ) )
kvm_mmu_commit_zap_page ( kvm , invalid_list ) ;
else
kvm_flush_remote_tlbs ( kvm ) ;
return true ;
}
2019-09-12 19:46:02 -07:00
static bool is_obsolete_sp ( struct kvm * kvm , struct kvm_mmu_page * sp )
{
KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.
For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.
For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest. But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root. Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete. But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.
Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-20 04:50:22 +00:00
if ( sp - > role . invalid )
return true ;
2022-09-13 17:17:25 +08:00
/* TDP MMU pages do not use the MMU generation. */
2022-10-12 18:17:00 +00:00
return ! is_tdp_mmu_page ( sp ) & &
2019-09-12 19:46:03 -07:00
unlikely ( sp - > mmu_valid_gen ! = kvm - > arch . mmu_valid_gen ) ;
2019-09-12 19:46:02 -07:00
}
2008-12-01 22:32:02 -02:00
struct mmu_page_path {
2017-08-24 20:27:54 +08:00
struct kvm_mmu_page * parent [ PT64_ROOT_MAX_LEVEL ] ;
unsigned int idx [ PT64_ROOT_MAX_LEVEL ] ;
2008-09-23 13:18:39 -03:00
} ;
2008-12-01 22:32:02 -02:00
# define for_each_sp(pvec, sp, parents, i) \
2016-02-23 13:54:25 +01:00
for ( i = mmu_pages_first ( & pvec , & parents ) ; \
2008-12-01 22:32:02 -02:00
i < pvec . nr & & ( { sp = pvec . page [ i ] . sp ; 1 ; } ) ; \
i = mmu_pages_next ( & pvec , & parents , i ) )
2009-02-21 02:19:13 +01:00
static int mmu_pages_next ( struct kvm_mmu_pages * pvec ,
struct mmu_page_path * parents ,
int i )
2008-12-01 22:32:02 -02:00
{
int n ;
for ( n = i + 1 ; n < pvec - > nr ; n + + ) {
struct kvm_mmu_page * sp = pvec - > page [ n ] . sp ;
2016-02-23 13:54:25 +01:00
unsigned idx = pvec - > page [ n ] . idx ;
int level = sp - > role . level ;
2008-12-01 22:32:02 -02:00
2016-02-23 13:54:25 +01:00
parents - > idx [ level - 1 ] = idx ;
2020-04-27 17:54:22 -07:00
if ( level = = PG_LEVEL_4K )
2016-02-23 13:54:25 +01:00
break ;
2008-12-01 22:32:02 -02:00
2016-02-23 13:54:25 +01:00
parents - > parent [ level - 2 ] = sp ;
2008-12-01 22:32:02 -02:00
}
return n ;
}
2016-02-23 13:54:25 +01:00
static int mmu_pages_first ( struct kvm_mmu_pages * pvec ,
struct mmu_page_path * parents )
{
struct kvm_mmu_page * sp ;
int level ;
if ( pvec - > nr = = 0 )
return 0 ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( pvec - > page [ 0 ] . idx ! = INVALID_INDEX ) ;
2016-02-24 09:46:06 +01:00
2016-02-23 13:54:25 +01:00
sp = pvec - > page [ 0 ] . sp ;
level = sp - > role . level ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( level = = PG_LEVEL_4K ) ;
2016-02-23 13:54:25 +01:00
parents - > parent [ level - 2 ] = sp ;
/* Also set up a sentinel. Further entries in pvec are all
* children of sp , so this element is never overwritten .
*/
parents - > parent [ level - 1 ] = NULL ;
return mmu_pages_next ( pvec , parents , 0 ) ;
}
2009-02-21 02:19:13 +01:00
static void mmu_pages_clear_parents ( struct mmu_page_path * parents )
2008-09-23 13:18:39 -03:00
{
2008-12-01 22:32:02 -02:00
struct kvm_mmu_page * sp ;
unsigned int level = 0 ;
do {
unsigned int idx = parents - > idx [ level ] ;
sp = parents - > parent [ level ] ;
if ( ! sp )
return ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( idx = = INVALID_INDEX ) ;
2015-11-20 17:43:13 +09:00
clear_unsync_child_bit ( sp , idx ) ;
2008-12-01 22:32:02 -02:00
level + + ;
2016-02-23 13:54:25 +01:00
} while ( ! sp - > unsync_children ) ;
2008-12-01 22:32:02 -02:00
}
2008-09-23 13:18:39 -03:00
2021-09-18 08:56:28 +08:00
static int mmu_sync_children ( struct kvm_vcpu * vcpu ,
struct kvm_mmu_page * parent , bool can_yield )
2008-12-01 22:32:02 -02:00
{
int i ;
struct kvm_mmu_page * sp ;
struct mmu_page_path parents ;
struct kvm_mmu_pages pages ;
2010-06-04 21:55:29 +08:00
LIST_HEAD ( invalid_list ) ;
2016-02-25 10:47:38 +01:00
bool flush = false ;
2008-12-01 22:32:02 -02:00
while ( mmu_unsync_walk ( parent , & pages ) ) {
2012-06-20 15:56:53 +08:00
bool protected = false ;
2008-12-01 22:32:03 -02:00
for_each_sp ( pages , sp , parents , i )
2022-01-19 23:07:22 +00:00
protected | = kvm_vcpu_write_protect_gfn ( vcpu , sp - > gfn ) ;
2008-12-01 22:32:03 -02:00
2016-02-25 10:47:38 +01:00
if ( protected ) {
2021-09-18 08:56:33 +08:00
kvm_mmu_remote_flush_or_zap ( vcpu - > kvm , & invalid_list , true ) ;
2016-02-25 10:47:38 +01:00
flush = false ;
}
2008-12-01 22:32:03 -02:00
2008-12-01 22:32:02 -02:00
for_each_sp ( pages , sp , parents , i ) {
2021-06-22 10:56:57 -07:00
kvm_unlink_unsync_page ( vcpu - > kvm , sp ) ;
2022-03-15 17:35:13 +08:00
flush | = kvm_sync_page ( vcpu , sp , & invalid_list ) > 0 ;
2008-12-01 22:32:02 -02:00
mmu_pages_clear_parents ( & parents ) ;
}
2021-02-02 10:57:24 -08:00
if ( need_resched ( ) | | rwlock_needbreak ( & vcpu - > kvm - > mmu_lock ) ) {
2021-09-18 08:56:32 +08:00
kvm_mmu_remote_flush_or_zap ( vcpu - > kvm , & invalid_list , flush ) ;
2021-09-18 08:56:28 +08:00
if ( ! can_yield ) {
kvm_make_request ( KVM_REQ_MMU_SYNC , vcpu ) ;
return - EINTR ;
}
2021-02-02 10:57:24 -08:00
cond_resched_rwlock_write ( & vcpu - > kvm - > mmu_lock ) ;
2016-02-25 10:47:38 +01:00
flush = false ;
}
2008-12-01 22:32:02 -02:00
}
2016-02-25 10:47:38 +01:00
2021-09-18 08:56:32 +08:00
kvm_mmu_remote_flush_or_zap ( vcpu - > kvm , & invalid_list , flush ) ;
2021-09-18 08:56:28 +08:00
return 0 ;
2008-09-23 13:18:39 -03:00
}
KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough
Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-22 16:58:36 +08:00
static void __clear_sp_write_flooding_count ( struct kvm_mmu_page * sp )
{
2016-02-24 17:51:12 +08:00
atomic_set ( & sp - > write_flooding_count , 0 ) ;
KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough
Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-22 16:58:36 +08:00
}
static void clear_sp_write_flooding_count ( u64 * spte )
{
2020-06-22 13:20:33 -07:00
__clear_sp_write_flooding_count ( sptep_to_sp ( spte ) ) ;
KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough
Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-22 16:58:36 +08:00
}
2022-06-22 15:27:00 -04:00
/*
* The vCPU is required when finding indirect shadow pages ; the shadow
* page may already exist and syncing it needs the vCPU pointer in
* order to read guest page tables . Direct shadow pages are never
* unsync , thus @ vcpu can be NULL if @ role . direct is true .
*/
2022-06-22 15:26:59 -04:00
static struct kvm_mmu_page * kvm_mmu_find_shadow_page ( struct kvm * kvm ,
struct kvm_vcpu * vcpu ,
2022-06-22 15:26:53 -04:00
gfn_t gfn ,
struct hlist_head * sp_list ,
union kvm_mmu_page_role role )
[PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 16:36:43 -08:00
{
2010-05-24 15:41:33 +08:00
struct kvm_mmu_page * sp ;
2022-03-15 17:35:13 +08:00
int ret ;
2016-12-20 15:25:57 -08:00
int collisions = 0 ;
2016-02-24 11:26:10 +01:00
LIST_HEAD ( invalid_list ) ;
[PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 16:36:43 -08:00
2022-06-22 15:26:59 -04:00
for_each_valid_sp ( kvm , sp , sp_list ) {
2016-12-20 15:25:57 -08:00
if ( sp - > gfn ! = gfn ) {
collisions + + ;
continue ;
}
KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at GFN
When creating a new upper-level shadow page, zap unsync shadow pages at
the same target gfn instead of attempting to sync the pages. This fixes
a bug where an unsync shadow page could be sync'd with an incompatible
context, e.g. wrong smm, is_guest, etc... flags. In practice, the bug is
relatively benign as sync_page() is all but guaranteed to fail its check
that the guest's desired gfn (for the to-be-sync'd page) matches the
current gfn associated with the shadow page. I.e. kvm_sync_page() would
end up zapping the page anyways.
Alternatively, __kvm_sync_page() could be modified to explicitly verify
the mmu_role of the unsync shadow page is compatible with the current MMU
context. But, except for this specific case, __kvm_sync_page() is called
iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page()
requires an exact role match, and the call from kvm_sync_mmu_roots() is
only synchronizing shadow pages from the current MMU (which better be
compatible or KVM has problems). And as described above, attempting to
sync shadow pages when creating an upper-level shadow page is unlikely
to succeed, e.g. zero successful syncs were observed when running Linux
guests despite over a million attempts.
Fixes: 9f1a122f970d ("KVM: MMU: allow more page become unsync at getting sp time")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-10-seanjc@google.com>
[Remove WARN_ON after __kvm_sync_page. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-22 10:56:54 -07:00
if ( sp - > role . word ! = role . word ) {
/*
* If the guest is creating an upper - level page , zap
* unsync pages for the same gfn . While it ' s possible
* the guest is using recursive page tables , in all
* likelihood the guest has stopped using the unsync
* page and is installing a completely unrelated page .
* Unsync pages must not be left as is , because the new
* upper - level page will be write - protected .
*/
2022-06-22 15:26:51 -04:00
if ( role . level > PG_LEVEL_4K & & sp - > unsync )
2022-06-22 15:26:59 -04:00
kvm_mmu_prepare_zap_page ( kvm , sp ,
KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at GFN
When creating a new upper-level shadow page, zap unsync shadow pages at
the same target gfn instead of attempting to sync the pages. This fixes
a bug where an unsync shadow page could be sync'd with an incompatible
context, e.g. wrong smm, is_guest, etc... flags. In practice, the bug is
relatively benign as sync_page() is all but guaranteed to fail its check
that the guest's desired gfn (for the to-be-sync'd page) matches the
current gfn associated with the shadow page. I.e. kvm_sync_page() would
end up zapping the page anyways.
Alternatively, __kvm_sync_page() could be modified to explicitly verify
the mmu_role of the unsync shadow page is compatible with the current MMU
context. But, except for this specific case, __kvm_sync_page() is called
iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page()
requires an exact role match, and the call from kvm_sync_mmu_roots() is
only synchronizing shadow pages from the current MMU (which better be
compatible or KVM has problems). And as described above, attempting to
sync shadow pages when creating an upper-level shadow page is unlikely
to succeed, e.g. zero successful syncs were observed when running Linux
guests despite over a million attempts.
Fixes: 9f1a122f970d ("KVM: MMU: allow more page become unsync at getting sp time")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-10-seanjc@google.com>
[Remove WARN_ON after __kvm_sync_page. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-22 10:56:54 -07:00
& invalid_list ) ;
2010-06-04 21:53:07 +08:00
continue ;
KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at GFN
When creating a new upper-level shadow page, zap unsync shadow pages at
the same target gfn instead of attempting to sync the pages. This fixes
a bug where an unsync shadow page could be sync'd with an incompatible
context, e.g. wrong smm, is_guest, etc... flags. In practice, the bug is
relatively benign as sync_page() is all but guaranteed to fail its check
that the guest's desired gfn (for the to-be-sync'd page) matches the
current gfn associated with the shadow page. I.e. kvm_sync_page() would
end up zapping the page anyways.
Alternatively, __kvm_sync_page() could be modified to explicitly verify
the mmu_role of the unsync shadow page is compatible with the current MMU
context. But, except for this specific case, __kvm_sync_page() is called
iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page()
requires an exact role match, and the call from kvm_sync_mmu_roots() is
only synchronizing shadow pages from the current MMU (which better be
compatible or KVM has problems). And as described above, attempting to
sync shadow pages when creating an upper-level shadow page is unlikely
to succeed, e.g. zero successful syncs were observed when running Linux
guests despite over a million attempts.
Fixes: 9f1a122f970d ("KVM: MMU: allow more page become unsync at getting sp time")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-10-seanjc@google.com>
[Remove WARN_ON after __kvm_sync_page. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-22 10:56:54 -07:00
}
2008-09-23 13:18:39 -03:00
2022-06-22 15:26:48 -04:00
/* unsync and write-flooding only apply to indirect SPs. */
if ( sp - > role . direct )
2022-06-22 15:26:53 -04:00
goto out ;
2020-06-23 12:40:27 -07:00
2016-02-24 11:26:10 +01:00
if ( sp - > unsync ) {
2022-06-22 15:27:00 -04:00
if ( KVM_BUG_ON ( ! vcpu , kvm ) )
break ;
2021-06-23 12:49:19 -04:00
/*
2021-06-22 10:56:57 -07:00
* The page is good , but is stale . kvm_sync_page does
2021-06-23 12:49:19 -04:00
* get the latest guest state , but ( unlike mmu_unsync_children )
* it doesn ' t write - protect the page or mark it synchronized !
* This way the validity of the mapping is ensured , but the
* overhead of write protection is not incurred until the
* guest invalidates the TLB mapping . This allows multiple
* SPs for a single gfn to be unsync .
*
* If the sync fails , the page is zapped . If so , break
* in order to rebuild it .
2016-02-24 11:26:10 +01:00
*/
2022-03-15 17:35:13 +08:00
ret = kvm_sync_page ( vcpu , sp , & invalid_list ) ;
if ( ret < 0 )
2016-02-24 11:26:10 +01:00
break ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( ! list_empty ( & invalid_list ) ) ;
2022-03-15 17:35:13 +08:00
if ( ret > 0 )
2022-06-22 15:26:59 -04:00
kvm_flush_remote_tlbs ( kvm ) ;
2016-02-24 11:26:10 +01:00
}
KVM: MMU: don't write-protect if have new mapping to unsync page
Two cases maybe happen in kvm_mmu_get_page() function:
- one case is, the goal sp is already in cache, if the sp is unsync,
we only need update it to assure this mapping is valid, but not
mark it sync and not write-protect sp->gfn since it not broke unsync
rule(one shadow page for a gfn)
- another case is, the goal sp not existed, we need create a new sp
for gfn, i.e, gfn (may)has another shadow page, to keep unsync rule,
we should sync(mark sync and write-protect) gfn's unsync shadow page.
After enabling multiple unsync shadows, we sync those shadow pages
only when the new sp not allow to become unsync(also for the unsyc
rule, the new rule is: allow all pte page become unsync)
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-15 18:52:34 +08:00
KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough
Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-22 16:58:36 +08:00
__clear_sp_write_flooding_count ( sp ) ;
2020-06-23 12:40:27 -07:00
2016-12-20 15:25:57 -08:00
goto out ;
2010-06-04 21:53:07 +08:00
}
2015-11-20 17:46:29 +09:00
2022-06-22 15:26:53 -04:00
sp = NULL ;
2022-06-22 15:26:59 -04:00
+ + kvm - > stat . mmu_cache_miss ;
2015-11-20 17:46:29 +09:00
2022-06-22 15:26:53 -04:00
out :
2022-06-22 15:26:59 -04:00
kvm_mmu_commit_zap_page ( kvm , & invalid_list ) ;
2022-06-22 15:26:53 -04:00
2022-06-22 15:26:59 -04:00
if ( collisions > kvm - > stat . max_mmu_page_hash_collisions )
kvm - > stat . max_mmu_page_hash_collisions = collisions ;
2022-06-22 15:26:53 -04:00
return sp ;
}
2022-06-22 15:26:57 -04:00
/* Caches used when allocating a new shadow page. */
struct shadow_page_caches {
struct kvm_mmu_memory_cache * page_header_cache ;
struct kvm_mmu_memory_cache * shadow_page_cache ;
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
struct kvm_mmu_memory_cache * shadowed_info_cache ;
2022-06-22 15:26:57 -04:00
} ;
2022-06-22 15:26:58 -04:00
static struct kvm_mmu_page * kvm_mmu_alloc_shadow_page ( struct kvm * kvm ,
2022-06-22 15:26:57 -04:00
struct shadow_page_caches * caches ,
2022-06-22 15:26:53 -04:00
gfn_t gfn ,
struct hlist_head * sp_list ,
union kvm_mmu_page_role role )
{
2022-06-22 15:26:54 -04:00
struct kvm_mmu_page * sp ;
2022-06-22 15:26:57 -04:00
sp = kvm_mmu_memory_cache_alloc ( caches - > page_header_cache ) ;
sp - > spt = kvm_mmu_memory_cache_alloc ( caches - > shadow_page_cache ) ;
2022-06-22 15:26:54 -04:00
if ( ! role . direct )
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
sp - > shadowed_translation = kvm_mmu_memory_cache_alloc ( caches - > shadowed_info_cache ) ;
2022-06-22 15:26:54 -04:00
set_page_private ( virt_to_page ( sp - > spt ) , ( unsigned long ) sp ) ;
2022-10-19 16:56:12 +00:00
INIT_LIST_HEAD ( & sp - > possible_nx_huge_page_link ) ;
2022-10-19 16:56:11 +00:00
2022-06-22 15:26:54 -04:00
/*
* active_mmu_pages must be a FIFO list , as kvm_zap_obsolete_pages ( )
* depends on valid pages being added to the head of the list . See
* comments in kvm_zap_obsolete_pages ( ) .
*/
2022-06-22 15:26:58 -04:00
sp - > mmu_valid_gen = kvm - > arch . mmu_valid_gen ;
list_add ( & sp - > link , & kvm - > arch . active_mmu_pages ) ;
2022-08-23 00:46:37 +00:00
kvm_account_mmu_page ( kvm , sp ) ;
2015-11-20 17:46:29 +09:00
2007-11-21 15:28:32 +02:00
sp - > gfn = gfn ;
sp - > role = role ;
2020-06-23 12:40:26 -07:00
hlist_add_head ( & sp - > hash_link , sp_list ) ;
2022-06-22 15:26:56 -04:00
if ( sp_has_gptes ( sp ) )
2022-06-22 15:26:58 -04:00
account_shadowed ( kvm , sp ) ;
KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at GFN
When creating a new upper-level shadow page, zap unsync shadow pages at
the same target gfn instead of attempting to sync the pages. This fixes
a bug where an unsync shadow page could be sync'd with an incompatible
context, e.g. wrong smm, is_guest, etc... flags. In practice, the bug is
relatively benign as sync_page() is all but guaranteed to fail its check
that the guest's desired gfn (for the to-be-sync'd page) matches the
current gfn associated with the shadow page. I.e. kvm_sync_page() would
end up zapping the page anyways.
Alternatively, __kvm_sync_page() could be modified to explicitly verify
the mmu_role of the unsync shadow page is compatible with the current MMU
context. But, except for this specific case, __kvm_sync_page() is called
iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page()
requires an exact role match, and the call from kvm_sync_mmu_roots() is
only synchronizing shadow pages from the current MMU (which better be
compatible or KVM has problems). And as described above, attempting to
sync shadow pages when creating an upper-level shadow page is unlikely
to succeed, e.g. zero successful syncs were observed when running Linux
guests despite over a million attempts.
Fixes: 9f1a122f970d ("KVM: MMU: allow more page become unsync at getting sp time")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-10-seanjc@google.com>
[Remove WARN_ON after __kvm_sync_page. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-22 10:56:54 -07:00
2022-06-22 15:26:53 -04:00
return sp ;
}
2022-06-22 15:27:00 -04:00
/* Note, @vcpu may be NULL if @role.direct is true; see kvm_mmu_find_shadow_page. */
2022-06-22 15:26:59 -04:00
static struct kvm_mmu_page * __kvm_mmu_get_shadow_page ( struct kvm * kvm ,
struct kvm_vcpu * vcpu ,
2022-06-22 15:26:57 -04:00
struct shadow_page_caches * caches ,
gfn_t gfn ,
union kvm_mmu_page_role role )
2022-06-22 15:26:53 -04:00
{
struct hlist_head * sp_list ;
struct kvm_mmu_page * sp ;
bool created = false ;
2022-06-22 15:26:59 -04:00
sp_list = & kvm - > arch . mmu_page_hash [ kvm_page_table_hashfn ( gfn ) ] ;
2022-06-22 15:26:53 -04:00
2022-06-22 15:26:59 -04:00
sp = kvm_mmu_find_shadow_page ( kvm , vcpu , gfn , sp_list , role ) ;
2022-06-22 15:26:53 -04:00
if ( ! sp ) {
created = true ;
2022-06-22 15:26:59 -04:00
sp = kvm_mmu_alloc_shadow_page ( kvm , caches , gfn , sp_list , role ) ;
2022-06-22 15:26:53 -04:00
}
trace_kvm_mmu_get_page ( sp , created ) ;
2007-11-21 15:28:32 +02:00
return sp ;
[PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 16:36:43 -08:00
}
2022-06-22 15:26:57 -04:00
static struct kvm_mmu_page * kvm_mmu_get_shadow_page ( struct kvm_vcpu * vcpu ,
gfn_t gfn ,
union kvm_mmu_page_role role )
{
struct shadow_page_caches caches = {
. page_header_cache = & vcpu - > arch . mmu_page_header_cache ,
. shadow_page_cache = & vcpu - > arch . mmu_shadow_page_cache ,
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
. shadowed_info_cache = & vcpu - > arch . mmu_shadowed_info_cache ,
2022-06-22 15:26:57 -04:00
} ;
2022-06-22 15:26:59 -04:00
return __kvm_mmu_get_shadow_page ( vcpu - > kvm , vcpu , & caches , gfn , role ) ;
2022-06-22 15:26:57 -04:00
}
2022-07-12 02:07:23 +00:00
static union kvm_mmu_page_role kvm_mmu_child_role ( u64 * sptep , bool direct ,
unsigned int access )
2022-06-22 15:26:51 -04:00
{
struct kvm_mmu_page * parent_sp = sptep_to_sp ( sptep ) ;
union kvm_mmu_page_role role ;
role = parent_sp - > role ;
role . level - - ;
role . access = access ;
role . direct = direct ;
role . passthrough = 0 ;
/*
* If the guest has 4 - byte PTEs then that means it ' s using 32 - bit ,
* 2 - level , non - PAE paging . KVM shadows such guests with PAE paging
* ( i . e . 8 - byte PTEs ) . The difference in PTE size means that KVM must
* shadow each guest page table with multiple shadow page tables , which
* requires extra bookkeeping in the role .
*
* Specifically , to shadow the guest ' s page directory ( which covers a
* 4 GiB address space ) , KVM uses 4 PAE page directories , each mapping
* 1 GiB of the address space . @ role . quadrant encodes which quarter of
* the address space each maps .
*
* To shadow the guest ' s page tables ( which each map a 4 MiB region ) , KVM
* uses 2 PAE page tables , each mapping a 2 MiB region . For these ,
* @ role . quadrant encodes which half of the region they map .
*
2022-07-12 02:07:23 +00:00
* Concretely , a 4 - byte PDE consumes bits 31 : 22 , while an 8 - byte PDE
* consumes bits 29 : 21. To consume bits 31 : 30 , KVM ' s uses 4 shadow
* PDPTEs ; those 4 PAE page directories are pre - allocated and their
* quadrant is assigned in mmu_alloc_root ( ) . A 4 - byte PTE consumes
* bits 21 : 12 , while an 8 - byte PTE consumes bits 20 : 12. To consume
* bit 21 in the PTE ( the child here ) , KVM propagates that bit to the
* quadrant , i . e . sets quadrant to ' 0 ' or ' 1 ' . The parent 8 - byte PDE
* covers bit 21 ( see above ) , thus the quadrant is calculated from the
* _least_ significant bit of the PDE index .
2022-06-22 15:26:51 -04:00
*/
if ( role . has_4_byte_gpte ) {
WARN_ON_ONCE ( role . level ! = PG_LEVEL_4K ) ;
2022-07-12 02:07:22 +00:00
role . quadrant = spte_index ( sptep ) & 1 ;
2022-06-22 15:26:51 -04:00
}
return role ;
}
static struct kvm_mmu_page * kvm_mmu_get_child_sp ( struct kvm_vcpu * vcpu ,
u64 * sptep , gfn_t gfn ,
bool direct , unsigned int access )
{
union kvm_mmu_page_role role ;
KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE. This is done by drop_large_spte().
However, dropping the large SPTE is really only necessary before the
sp is installed. While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page(). Move the call
there instead of having it in each and every caller.
To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.
Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:07 -04:00
if ( is_shadow_present_pte ( * sptep ) & & ! is_large_pte ( * sptep ) )
return ERR_PTR ( - EEXIST ) ;
2022-06-22 15:26:51 -04:00
role = kvm_mmu_child_role ( sptep , direct , access ) ;
2022-06-22 15:26:55 -04:00
return kvm_mmu_get_shadow_page ( vcpu , gfn , role ) ;
2022-06-22 15:26:51 -04:00
}
2018-06-27 14:59:16 -07:00
static void shadow_walk_init_using_root ( struct kvm_shadow_walk_iterator * iterator ,
struct kvm_vcpu * vcpu , hpa_t root ,
u64 addr )
2008-12-25 14:39:47 +02:00
{
iterator - > addr = addr ;
2018-06-27 14:59:16 -07:00
iterator - > shadow_addr = root ;
2022-02-10 07:41:19 -05:00
iterator - > level = vcpu - > arch . mmu - > root_role . level ;
2010-09-10 17:31:00 +02:00
2021-11-24 20:20:43 +08:00
if ( iterator - > level > = PT64_ROOT_4LEVEL & &
2022-02-10 07:42:22 -05:00
vcpu - > arch . mmu - > cpu_role . base . level < PT64_ROOT_4LEVEL & &
2022-02-10 08:00:56 -05:00
! vcpu - > arch . mmu - > root_role . direct )
2021-11-24 20:20:43 +08:00
iterator - > level = PT32E_ROOT_LEVEL ;
2010-09-10 17:31:00 +02:00
2008-12-25 14:39:47 +02:00
if ( iterator - > level = = PT32E_ROOT_LEVEL ) {
2018-06-27 14:59:16 -07:00
/*
* prev_root is currently only used for 64 - bit hosts . So only
* the active root_hpa is valid here .
*/
2022-02-21 09:28:33 -05:00
BUG_ON ( root ! = vcpu - > arch . mmu - > root . hpa ) ;
2018-06-27 14:59:16 -07:00
2008-12-25 14:39:47 +02:00
iterator - > shadow_addr
2018-10-08 21:28:05 +02:00
= vcpu - > arch . mmu - > pae_root [ ( addr > > 30 ) & 3 ] ;
2022-06-14 23:33:25 +00:00
iterator - > shadow_addr & = SPTE_BASE_ADDR_MASK ;
2008-12-25 14:39:47 +02:00
- - iterator - > level ;
if ( ! iterator - > shadow_addr )
iterator - > level = 0 ;
}
}
2018-06-27 14:59:16 -07:00
static void shadow_walk_init ( struct kvm_shadow_walk_iterator * iterator ,
struct kvm_vcpu * vcpu , u64 addr )
{
2022-02-21 09:28:33 -05:00
shadow_walk_init_using_root ( iterator , vcpu , vcpu - > arch . mmu - > root . hpa ,
2018-06-27 14:59:16 -07:00
addr ) ;
}
2008-12-25 14:39:47 +02:00
static bool shadow_walk_okay ( struct kvm_shadow_walk_iterator * iterator )
{
2020-04-27 17:54:22 -07:00
if ( iterator - > level < PG_LEVEL_4K )
2008-12-25 14:39:47 +02:00
return false ;
2009-06-11 12:07:41 -03:00
2022-06-14 23:33:25 +00:00
iterator - > index = SPTE_INDEX ( iterator - > addr , iterator - > level ) ;
2008-12-25 14:39:47 +02:00
iterator - > sptep = ( ( u64 * ) __va ( iterator - > shadow_addr ) ) + iterator - > index ;
return true ;
}
2011-07-12 03:32:13 +08:00
static void __shadow_walk_next ( struct kvm_shadow_walk_iterator * iterator ,
u64 spte )
2008-12-25 14:39:47 +02:00
{
2021-09-06 20:25:47 +08:00
if ( ! is_shadow_present_pte ( spte ) | | is_last_spte ( spte , iterator - > level ) ) {
2011-07-12 03:21:17 +08:00
iterator - > level = 0 ;
return ;
}
2022-06-14 23:33:25 +00:00
iterator - > shadow_addr = spte & SPTE_BASE_ADDR_MASK ;
2008-12-25 14:39:47 +02:00
- - iterator - > level ;
}
2011-07-12 03:32:13 +08:00
static void shadow_walk_next ( struct kvm_shadow_walk_iterator * iterator )
{
2017-08-24 20:51:23 +02:00
__shadow_walk_next ( iterator , * iterator - > sptep ) ;
2011-07-12 03:32:13 +08:00
}
KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE. This is done by drop_large_spte().
However, dropping the large SPTE is really only necessary before the
sp is installed. While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page(). Move the call
there instead of having it in each and every caller.
To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.
Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:07 -04:00
static void __link_shadow_page ( struct kvm * kvm ,
struct kvm_mmu_memory_cache * cache , u64 * sptep ,
2022-06-22 15:27:10 -04:00
struct kvm_mmu_page * sp , bool flush )
2020-09-25 14:22:48 -07:00
{
u64 spte ;
BUILD_BUG_ON ( VMX_EPT_WRITABLE_MASK ! = PT_WRITABLE_MASK ) ;
KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE. This is done by drop_large_spte().
However, dropping the large SPTE is really only necessary before the
sp is installed. While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page(). Move the call
there instead of having it in each and every caller.
To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.
Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:07 -04:00
/*
* If an SPTE is present already , it must be a leaf and therefore
2022-06-22 15:27:10 -04:00
* a large one . Drop it , and flush the TLB if needed , before
* installing sp .
KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE. This is done by drop_large_spte().
However, dropping the large SPTE is really only necessary before the
sp is installed. While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page(). Move the call
there instead of having it in each and every caller.
To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.
Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:07 -04:00
*/
if ( is_shadow_present_pte ( * sptep ) )
2022-06-22 15:27:10 -04:00
drop_large_spte ( kvm , sptep , flush ) ;
KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE. This is done by drop_large_spte().
However, dropping the large SPTE is really only necessary before the
sp is installed. While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page(). Move the call
there instead of having it in each and every caller.
To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.
Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:07 -04:00
2020-09-25 14:22:48 -07:00
spte = make_nonleaf_spte ( sp - > spt , sp_ad_disabled ( sp ) ) ;
2011-07-12 03:30:35 +08:00
mmu_spte_set ( sptep , spte ) ;
2015-11-26 21:14:34 +09:00
2022-06-22 15:27:02 -04:00
mmu_page_add_parent_pte ( cache , sp , sptep ) ;
2015-11-26 21:14:34 +09:00
2022-12-12 17:01:06 +08:00
/*
* The non - direct sub - pagetable must be updated before linking . For
* L1 sp , the pagetable is updated via kvm_sync_page ( ) in
* kvm_mmu_find_shadow_page ( ) without write - protecting the gfn ,
* so sp - > unsync can be true or false . For higher level non - direct
* sp , the pagetable is updated / synced via mmu_sync_children ( ) in
* FNAME ( fetch ) ( ) , so sp - > unsync_children can only be false .
* WARN_ON_ONCE ( ) if anything happens unexpectedly .
*/
if ( WARN_ON_ONCE ( sp - > unsync_children ) | | sp - > unsync )
2015-11-26 21:14:34 +09:00
mark_unsync ( sptep ) ;
2010-07-13 14:27:04 +03:00
}
2022-06-22 15:27:02 -04:00
static void link_shadow_page ( struct kvm_vcpu * vcpu , u64 * sptep ,
struct kvm_mmu_page * sp )
{
2022-06-22 15:27:10 -04:00
__link_shadow_page ( vcpu - > kvm , & vcpu - > arch . mmu_pte_list_desc_cache , sptep , sp , true ) ;
2022-06-22 15:27:02 -04:00
}
2010-07-13 14:27:07 +03:00
static void validate_direct_spte ( struct kvm_vcpu * vcpu , u64 * sptep ,
unsigned direct_access )
{
if ( is_shadow_present_pte ( * sptep ) & & ! is_large_pte ( * sptep ) ) {
struct kvm_mmu_page * child ;
/*
* For the direct sp , if the guest pte ' s dirty bit
* changed form clean to dirty , it will corrupt the
* sp ' s access : allow writable in the read - only sp ,
* so we should update the spte at this point to get
* a new sp with the correct access .
*/
2022-10-19 16:56:16 +00:00
child = spte_to_child_sp ( * sptep ) ;
2010-07-13 14:27:07 +03:00
if ( child - > role . access = = direct_access )
return ;
2023-07-28 17:47:21 -07:00
drop_parent_pte ( vcpu - > kvm , child , sptep ) ;
2022-10-10 20:19:16 +08:00
kvm_flush_remote_tlbs_sptep ( vcpu - > kvm , sptep ) ;
2010-07-13 14:27:07 +03:00
}
}
2020-09-23 15:14:06 -07:00
/* Returns the number of zapped non-leaf child shadow pages. */
static int mmu_page_zap_pte ( struct kvm * kvm , struct kvm_mmu_page * sp ,
u64 * spte , struct list_head * invalid_list )
2011-05-15 23:27:52 +08:00
{
u64 pte ;
struct kvm_mmu_page * child ;
pte = * spte ;
if ( is_shadow_present_pte ( pte ) ) {
2011-09-22 16:56:06 +08:00
if ( is_last_spte ( pte , sp - > role . level ) ) {
2011-07-12 03:28:04 +08:00
drop_spte ( kvm , spte ) ;
2011-09-22 16:56:06 +08:00
} else {
2022-10-19 16:56:16 +00:00
child = spte_to_child_sp ( pte ) ;
2023-07-28 17:47:21 -07:00
drop_parent_pte ( kvm , child , spte ) ;
2020-09-23 15:14:06 -07:00
/*
* Recursively zap nested TDP SPs , parentless SPs are
* unlikely to be used again in the near future . This
* avoids retaining a large number of stale nested SPs .
*/
if ( tdp_enabled & & invalid_list & &
child - > role . guest_mode & & ! child - > parent_ptes . val )
return kvm_mmu_prepare_zap_page ( kvm , child ,
invalid_list ) ;
2011-05-15 23:27:52 +08:00
}
2020-09-23 15:14:05 -07:00
} else if ( is_mmio_spte ( pte ) ) {
2011-07-12 03:33:44 +08:00
mmu_spte_clear_no_track ( spte ) ;
2020-09-23 15:14:05 -07:00
}
2020-09-23 15:14:06 -07:00
return 0 ;
2011-05-15 23:27:52 +08:00
}
2020-09-23 15:14:06 -07:00
static int kvm_mmu_page_unlink_children ( struct kvm * kvm ,
struct kvm_mmu_page * sp ,
struct list_head * invalid_list )
2007-01-05 16:36:45 -08:00
{
2020-09-23 15:14:06 -07:00
int zapped = 0 ;
2007-01-05 16:36:46 -08:00
unsigned i ;
2022-06-14 23:33:25 +00:00
for ( i = 0 ; i < SPTE_ENT_PER_PAGE ; + + i )
2020-09-23 15:14:06 -07:00
zapped + = mmu_page_zap_pte ( kvm , sp , sp - > spt + i , invalid_list ) ;
return zapped ;
2007-01-05 16:36:45 -08:00
}
2023-07-28 17:47:21 -07:00
static void kvm_mmu_unlink_parents ( struct kvm * kvm , struct kvm_mmu_page * sp )
2007-01-05 16:36:45 -08:00
{
2012-03-21 23:50:34 +09:00
u64 * sptep ;
struct rmap_iterator iter ;
2007-01-05 16:36:45 -08:00
2015-11-20 17:41:28 +09:00
while ( ( sptep = rmap_get_first ( & sp - > parent_ptes , & iter ) ) )
2023-07-28 17:47:21 -07:00
drop_parent_pte ( kvm , sp , sptep ) ;
2008-07-11 17:59:46 +03:00
}
2008-12-01 22:32:02 -02:00
static int mmu_zap_unsync_children ( struct kvm * kvm ,
2010-06-04 21:53:54 +08:00
struct kvm_mmu_page * parent ,
struct list_head * invalid_list )
2008-09-23 13:18:39 -03:00
{
2008-12-01 22:32:02 -02:00
int i , zapped = 0 ;
struct mmu_page_path parents ;
struct kvm_mmu_pages pages ;
2008-09-23 13:18:39 -03:00
2020-04-27 17:54:22 -07:00
if ( parent - > role . level = = PG_LEVEL_4K )
2008-09-23 13:18:39 -03:00
return 0 ;
2008-12-01 22:32:02 -02:00
while ( mmu_unsync_walk ( parent , & pages ) ) {
struct kvm_mmu_page * sp ;
for_each_sp ( pages , sp , parents , i ) {
2010-06-04 21:53:54 +08:00
kvm_mmu_prepare_zap_page ( kvm , sp , invalid_list ) ;
2008-12-01 22:32:02 -02:00
mmu_pages_clear_parents ( & parents ) ;
2010-04-16 16:34:42 +08:00
zapped + + ;
2008-12-01 22:32:02 -02:00
}
}
return zapped ;
2008-09-23 13:18:39 -03:00
}
KVM: x86/mmu: Differentiate between nr zapped and list unstable
The return value of kvm_mmu_prepare_zap_page() has evolved to become
overloaded to convey two separate pieces of information. 1) was at
least one page zapped and 2) has the list of MMU pages become unstable.
In it's original incarnation (as kvm_mmu_zap_page()), there was no
return value at all. Commit 0738541396be ("KVM: MMU: awareness of new
kvm_mmu_zap_page behaviour") added a return value in preparation for
commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"). Although
the return value was of type 'int', it was actually used as a boolean
to indicate whether or not active_mmu_pages may have become unstable due
to zapping children. Walking a list with list_for_each_entry_safe()
only protects against deleting/moving the current entry, i.e. zapping a
child page would break iteration due to modifying any number of entries.
Later, commit 60c8aec6e2c9 ("KVM: MMU: use page array in unsync walk")
modified mmu_zap_unsync_children() to return an approximation of the
number of children zapped. This was not intentional, it was simply a
side effect of how the code was written.
The unintented side affect was then morphed into an actual feature by
commit 77662e0028c7 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
path"), which modified kvm_mmu_change_mmu_pages() to use the number of
zapped pages when determining the number of MMU pages in use by the VM.
Finally, commit 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed") added the initial page to the
return value to make its behavior more consistent with what most users
would expect. Incorporating the initial parent page in the return value
of kvm_mmu_zap_page() breaks the original usage of restarting a list
walk on a non-zero return value to handle a potentially unstable list,
i.e. walks will unnecessarily restart when any page is zapped.
Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
return a boolean to indicate that the list may be unstable and move the
number of zapped children to a dedicated parameter. Since the majority
of callers to kvm_mmu_prepare_zap_page() don't care about either return
value, preserve the current definition of kvm_mmu_prepare_zap_page() by
making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page(). This
avoids having to update every call site and also provides cleaner code
for functions that only care about the number of pages zapped.
Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 13:01:35 -08:00
static bool __kvm_mmu_prepare_zap_page ( struct kvm * kvm ,
struct kvm_mmu_page * sp ,
struct list_head * invalid_list ,
int * nr_zapped )
2008-07-11 17:59:46 +03:00
{
2022-02-25 18:22:45 +00:00
bool list_unstable , zapped_root = false ;
2009-07-06 15:58:14 +03:00
2022-11-23 14:36:00 -05:00
lockdep_assert_held_write ( & kvm - > mmu_lock ) ;
2010-06-04 21:53:54 +08:00
trace_kvm_mmu_prepare_zap_page ( sp ) ;
2008-07-11 17:59:46 +03:00
+ + kvm - > stat . mmu_shadow_zapped ;
KVM: x86/mmu: Differentiate between nr zapped and list unstable
The return value of kvm_mmu_prepare_zap_page() has evolved to become
overloaded to convey two separate pieces of information. 1) was at
least one page zapped and 2) has the list of MMU pages become unstable.
In it's original incarnation (as kvm_mmu_zap_page()), there was no
return value at all. Commit 0738541396be ("KVM: MMU: awareness of new
kvm_mmu_zap_page behaviour") added a return value in preparation for
commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"). Although
the return value was of type 'int', it was actually used as a boolean
to indicate whether or not active_mmu_pages may have become unstable due
to zapping children. Walking a list with list_for_each_entry_safe()
only protects against deleting/moving the current entry, i.e. zapping a
child page would break iteration due to modifying any number of entries.
Later, commit 60c8aec6e2c9 ("KVM: MMU: use page array in unsync walk")
modified mmu_zap_unsync_children() to return an approximation of the
number of children zapped. This was not intentional, it was simply a
side effect of how the code was written.
The unintented side affect was then morphed into an actual feature by
commit 77662e0028c7 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
path"), which modified kvm_mmu_change_mmu_pages() to use the number of
zapped pages when determining the number of MMU pages in use by the VM.
Finally, commit 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed") added the initial page to the
return value to make its behavior more consistent with what most users
would expect. Incorporating the initial parent page in the return value
of kvm_mmu_zap_page() breaks the original usage of restarting a list
walk on a non-zero return value to handle a potentially unstable list,
i.e. walks will unnecessarily restart when any page is zapped.
Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
return a boolean to indicate that the list may be unstable and move the
number of zapped children to a dedicated parameter. Since the majority
of callers to kvm_mmu_prepare_zap_page() don't care about either return
value, preserve the current definition of kvm_mmu_prepare_zap_page() by
making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page(). This
avoids having to update every call site and also provides cleaner code
for functions that only care about the number of pages zapped.
Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 13:01:35 -08:00
* nr_zapped = mmu_zap_unsync_children ( kvm , sp , invalid_list ) ;
2020-09-23 15:14:06 -07:00
* nr_zapped + = kvm_mmu_page_unlink_children ( kvm , sp , invalid_list ) ;
2023-07-28 17:47:21 -07:00
kvm_mmu_unlink_parents ( kvm , sp ) ;
2013-05-31 08:36:22 +08:00
KVM: x86/mmu: Differentiate between nr zapped and list unstable
The return value of kvm_mmu_prepare_zap_page() has evolved to become
overloaded to convey two separate pieces of information. 1) was at
least one page zapped and 2) has the list of MMU pages become unstable.
In it's original incarnation (as kvm_mmu_zap_page()), there was no
return value at all. Commit 0738541396be ("KVM: MMU: awareness of new
kvm_mmu_zap_page behaviour") added a return value in preparation for
commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"). Although
the return value was of type 'int', it was actually used as a boolean
to indicate whether or not active_mmu_pages may have become unstable due
to zapping children. Walking a list with list_for_each_entry_safe()
only protects against deleting/moving the current entry, i.e. zapping a
child page would break iteration due to modifying any number of entries.
Later, commit 60c8aec6e2c9 ("KVM: MMU: use page array in unsync walk")
modified mmu_zap_unsync_children() to return an approximation of the
number of children zapped. This was not intentional, it was simply a
side effect of how the code was written.
The unintented side affect was then morphed into an actual feature by
commit 77662e0028c7 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
path"), which modified kvm_mmu_change_mmu_pages() to use the number of
zapped pages when determining the number of MMU pages in use by the VM.
Finally, commit 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed") added the initial page to the
return value to make its behavior more consistent with what most users
would expect. Incorporating the initial parent page in the return value
of kvm_mmu_zap_page() breaks the original usage of restarting a list
walk on a non-zero return value to handle a potentially unstable list,
i.e. walks will unnecessarily restart when any page is zapped.
Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
return a boolean to indicate that the list may be unstable and move the
number of zapped children to a dedicated parameter. Since the majority
of callers to kvm_mmu_prepare_zap_page() don't care about either return
value, preserve the current definition of kvm_mmu_prepare_zap_page() by
making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page(). This
avoids having to update every call site and also provides cleaner code
for functions that only care about the number of pages zapped.
Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 13:01:35 -08:00
/* Zapping children means active_mmu_pages has become unstable. */
list_unstable = * nr_zapped ;
2022-04-20 21:12:03 +08:00
if ( ! sp - > role . invalid & & sp_has_gptes ( sp ) )
2015-05-19 16:29:22 +02:00
unaccount_shadowed ( kvm , sp ) ;
2013-05-31 08:36:22 +08:00
2008-09-23 13:18:39 -03:00
if ( sp - > unsync )
kvm_unlink_unsync_page ( kvm , sp ) ;
2007-11-21 15:28:32 +02:00
if ( ! sp - > root_count ) {
2010-05-05 09:03:49 +08:00
/* Count self */
KVM: x86/mmu: Differentiate between nr zapped and list unstable
The return value of kvm_mmu_prepare_zap_page() has evolved to become
overloaded to convey two separate pieces of information. 1) was at
least one page zapped and 2) has the list of MMU pages become unstable.
In it's original incarnation (as kvm_mmu_zap_page()), there was no
return value at all. Commit 0738541396be ("KVM: MMU: awareness of new
kvm_mmu_zap_page behaviour") added a return value in preparation for
commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"). Although
the return value was of type 'int', it was actually used as a boolean
to indicate whether or not active_mmu_pages may have become unstable due
to zapping children. Walking a list with list_for_each_entry_safe()
only protects against deleting/moving the current entry, i.e. zapping a
child page would break iteration due to modifying any number of entries.
Later, commit 60c8aec6e2c9 ("KVM: MMU: use page array in unsync walk")
modified mmu_zap_unsync_children() to return an approximation of the
number of children zapped. This was not intentional, it was simply a
side effect of how the code was written.
The unintented side affect was then morphed into an actual feature by
commit 77662e0028c7 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
path"), which modified kvm_mmu_change_mmu_pages() to use the number of
zapped pages when determining the number of MMU pages in use by the VM.
Finally, commit 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed") added the initial page to the
return value to make its behavior more consistent with what most users
would expect. Incorporating the initial parent page in the return value
of kvm_mmu_zap_page() breaks the original usage of restarting a list
walk on a non-zero return value to handle a potentially unstable list,
i.e. walks will unnecessarily restart when any page is zapped.
Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
return a boolean to indicate that the list may be unstable and move the
number of zapped children to a dedicated parameter. Since the majority
of callers to kvm_mmu_prepare_zap_page() don't care about either return
value, preserve the current definition of kvm_mmu_prepare_zap_page() by
making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page(). This
avoids having to update every call site and also provides cleaner code
for functions that only care about the number of pages zapped.
Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 13:01:35 -08:00
( * nr_zapped ) + + ;
2020-06-23 12:35:39 -07:00
/*
* Already invalid pages ( previously active roots ) are not on
* the active page list . See list_del ( ) in the " else " case of
* ! sp - > root_count .
*/
if ( sp - > role . invalid )
list_add ( & sp - > link , invalid_list ) ;
else
list_move ( & sp - > link , invalid_list ) ;
2022-08-23 00:46:37 +00:00
kvm_unaccount_mmu_page ( kvm , sp ) ;
2008-02-20 14:47:24 -05:00
} else {
2020-06-23 12:35:39 -07:00
/*
* Remove the active root from the active page list , the root
* will be explicitly freed when the root_count hits zero .
*/
list_del ( & sp - > link ) ;
2013-05-31 08:36:30 +08:00
2019-09-12 19:46:10 -07:00
/*
* Obsolete pages cannot be used on any vCPUs , see the comment
* in kvm_mmu_zap_all_fast ( ) . Note , is_obsolete_sp ( ) also
* treats invalid shadow pages as being obsolete .
*/
2022-02-25 18:22:45 +00:00
zapped_root = ! is_obsolete_sp ( kvm , sp ) ;
2008-02-20 14:47:24 -05:00
}
2010-06-04 21:53:54 +08:00
2022-10-19 16:56:12 +00:00
if ( sp - > nx_huge_page_disallowed )
unaccount_nx_huge_page ( kvm , sp ) ;
2019-11-04 12:22:02 +01:00
2010-06-04 21:53:54 +08:00
sp - > role . invalid = 1 ;
2022-02-25 18:22:45 +00:00
/*
* Make the request to free obsolete roots after marking the root
* invalid , otherwise other vCPUs may not see it as invalid .
*/
if ( zapped_root )
kvm_make_all_cpus_request ( kvm , KVM_REQ_MMU_FREE_OBSOLETE_ROOTS ) ;
KVM: x86/mmu: Differentiate between nr zapped and list unstable
The return value of kvm_mmu_prepare_zap_page() has evolved to become
overloaded to convey two separate pieces of information. 1) was at
least one page zapped and 2) has the list of MMU pages become unstable.
In it's original incarnation (as kvm_mmu_zap_page()), there was no
return value at all. Commit 0738541396be ("KVM: MMU: awareness of new
kvm_mmu_zap_page behaviour") added a return value in preparation for
commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"). Although
the return value was of type 'int', it was actually used as a boolean
to indicate whether or not active_mmu_pages may have become unstable due
to zapping children. Walking a list with list_for_each_entry_safe()
only protects against deleting/moving the current entry, i.e. zapping a
child page would break iteration due to modifying any number of entries.
Later, commit 60c8aec6e2c9 ("KVM: MMU: use page array in unsync walk")
modified mmu_zap_unsync_children() to return an approximation of the
number of children zapped. This was not intentional, it was simply a
side effect of how the code was written.
The unintented side affect was then morphed into an actual feature by
commit 77662e0028c7 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
path"), which modified kvm_mmu_change_mmu_pages() to use the number of
zapped pages when determining the number of MMU pages in use by the VM.
Finally, commit 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed") added the initial page to the
return value to make its behavior more consistent with what most users
would expect. Incorporating the initial parent page in the return value
of kvm_mmu_zap_page() breaks the original usage of restarting a list
walk on a non-zero return value to handle a potentially unstable list,
i.e. walks will unnecessarily restart when any page is zapped.
Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
return a boolean to indicate that the list may be unstable and move the
number of zapped children to a dedicated parameter. Since the majority
of callers to kvm_mmu_prepare_zap_page() don't care about either return
value, preserve the current definition of kvm_mmu_prepare_zap_page() by
making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page(). This
avoids having to update every call site and also provides cleaner code
for functions that only care about the number of pages zapped.
Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 13:01:35 -08:00
return list_unstable ;
}
static bool kvm_mmu_prepare_zap_page ( struct kvm * kvm , struct kvm_mmu_page * sp ,
struct list_head * invalid_list )
{
int nr_zapped ;
__kvm_mmu_prepare_zap_page ( kvm , sp , invalid_list , & nr_zapped ) ;
return nr_zapped ;
2007-01-05 16:36:45 -08:00
}
2010-06-04 21:53:54 +08:00
static void kvm_mmu_commit_zap_page ( struct kvm * kvm ,
struct list_head * invalid_list )
{
2013-03-06 16:05:52 +09:00
struct kvm_mmu_page * sp , * nsp ;
2010-06-04 21:53:54 +08:00
if ( list_empty ( invalid_list ) )
return ;
2012-05-14 15:44:06 +03:00
/*
2016-03-13 11:10:24 +08:00
* We need to make sure everyone sees our modifications to
* the page tables and see changes to vcpu - > mode here . The barrier
* in the kvm_flush_remote_tlbs ( ) achieves this . This pairs
* with vcpu_enter_guest and walk_shadow_page_lockless_begin / end .
*
* In addition , kvm_flush_remote_tlbs waits for all vcpus to exit
* guest mode and / or lockless shadow page table walks .
2012-05-14 15:44:06 +03:00
*/
kvm_flush_remote_tlbs ( kvm ) ;
2011-07-12 03:32:13 +08:00
2013-03-06 16:05:52 +09:00
list_for_each_entry_safe ( sp , nsp , invalid_list , link ) {
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( ! sp - > role . invalid | | sp - > root_count ) ;
2022-06-22 15:26:55 -04:00
kvm_mmu_free_shadow_page ( sp ) ;
2013-03-06 16:05:52 +09:00
}
2010-06-04 21:53:54 +08:00
}
2020-06-23 12:35:40 -07:00
static unsigned long kvm_mmu_zap_oldest_mmu_pages ( struct kvm * kvm ,
unsigned long nr_to_zap )
2013-03-06 16:06:58 +09:00
{
2020-06-23 12:35:40 -07:00
unsigned long total_zapped = 0 ;
struct kvm_mmu_page * sp , * tmp ;
2019-12-06 15:57:15 -08:00
LIST_HEAD ( invalid_list ) ;
2020-06-23 12:35:40 -07:00
bool unstable ;
int nr_zapped ;
2013-03-06 16:06:58 +09:00
if ( list_empty ( & kvm - > arch . active_mmu_pages ) )
2019-12-06 15:57:15 -08:00
return 0 ;
2020-06-23 12:35:40 -07:00
restart :
2021-01-13 12:50:30 -08:00
list_for_each_entry_safe_reverse ( sp , tmp , & kvm - > arch . active_mmu_pages , link ) {
2020-06-23 12:35:40 -07:00
/*
* Don ' t zap active root pages , the page itself can ' t be freed
* and zapping it will just force vCPUs to realloc and reload .
*/
if ( sp - > root_count )
continue ;
unstable = __kvm_mmu_prepare_zap_page ( kvm , sp , & invalid_list ,
& nr_zapped ) ;
total_zapped + = nr_zapped ;
if ( total_zapped > = nr_to_zap )
2019-12-06 15:57:15 -08:00
break ;
2020-06-23 12:35:40 -07:00
if ( unstable )
goto restart ;
2019-12-06 15:57:15 -08:00
}
2013-03-06 16:06:58 +09:00
2020-06-23 12:35:40 -07:00
kvm_mmu_commit_zap_page ( kvm , & invalid_list ) ;
kvm - > stat . mmu_recycled + = total_zapped ;
return total_zapped ;
}
2020-06-22 13:20:30 -07:00
static inline unsigned long kvm_mmu_available_pages ( struct kvm * kvm )
{
if ( kvm - > arch . n_max_mmu_pages > kvm - > arch . n_used_mmu_pages )
return kvm - > arch . n_max_mmu_pages -
kvm - > arch . n_used_mmu_pages ;
return 0 ;
2013-03-06 16:06:58 +09:00
}
2019-12-06 15:57:15 -08:00
static int make_mmu_pages_available ( struct kvm_vcpu * vcpu )
{
2020-06-23 12:35:40 -07:00
unsigned long avail = kvm_mmu_available_pages ( vcpu - > kvm ) ;
2019-12-06 15:57:15 -08:00
2020-06-23 12:35:40 -07:00
if ( likely ( avail > = KVM_MIN_FREE_MMU_PAGES ) )
2019-12-06 15:57:15 -08:00
return 0 ;
2020-06-23 12:35:40 -07:00
kvm_mmu_zap_oldest_mmu_pages ( vcpu - > kvm , KVM_REFILL_PAGES - avail ) ;
2019-12-06 15:57:15 -08:00
2021-03-04 17:10:50 -08:00
/*
* Note , this check is intentionally soft , it only guarantees that one
* page is available , while the caller may end up allocating as many as
* four pages , e . g . for PAE roots or for 5 - level paging . Temporarily
* exceeding the ( arbitrary by default ) limit will not harm the host ,
2021-05-12 19:58:31 +02:00
* being too aggressive may unnecessarily kill the guest , and getting an
2021-03-04 17:10:50 -08:00
* exact count is far more trouble than it ' s worth , especially in the
* page fault paths .
*/
2019-12-06 15:57:15 -08:00
if ( ! kvm_mmu_available_pages ( vcpu - > kvm ) )
return - ENOSPC ;
return 0 ;
}
2007-10-02 18:52:55 +02:00
/*
* Changing the number of mmu pages allocated to the vm
2010-08-19 18:11:28 -07:00
* Note : if goal_nr_mmu_pages is too small , you will get dead lock
2007-10-02 18:52:55 +02:00
*/
2019-04-08 11:07:30 -07:00
void kvm_mmu_change_mmu_pages ( struct kvm * kvm , unsigned long goal_nr_mmu_pages )
2007-10-02 18:52:55 +02:00
{
2021-02-02 10:57:24 -08:00
write_lock ( & kvm - > mmu_lock ) ;
2013-01-08 19:46:07 +09:00
2010-08-19 18:11:28 -07:00
if ( kvm - > arch . n_used_mmu_pages > goal_nr_mmu_pages ) {
2020-06-23 12:35:40 -07:00
kvm_mmu_zap_oldest_mmu_pages ( kvm , kvm - > arch . n_used_mmu_pages -
goal_nr_mmu_pages ) ;
2007-10-02 18:52:55 +02:00
2010-08-19 18:11:28 -07:00
goal_nr_mmu_pages = kvm - > arch . n_used_mmu_pages ;
2007-10-02 18:52:55 +02:00
}
2010-08-19 18:11:28 -07:00
kvm - > arch . n_max_mmu_pages = goal_nr_mmu_pages ;
2013-01-08 19:46:07 +09:00
2021-02-02 10:57:24 -08:00
write_unlock ( & kvm - > mmu_lock ) ;
2007-10-02 18:52:55 +02:00
}
2011-09-22 17:02:48 +08:00
int kvm_mmu_unprotect_page ( struct kvm * kvm , gfn_t gfn )
2007-01-05 16:36:45 -08:00
{
2007-11-21 15:28:32 +02:00
struct kvm_mmu_page * sp ;
2010-06-04 21:55:29 +08:00
LIST_HEAD ( invalid_list ) ;
2007-01-05 16:36:45 -08:00
int r ;
r = 0 ;
2021-02-02 10:57:24 -08:00
write_lock ( & kvm - > mmu_lock ) ;
2022-04-20 21:12:03 +08:00
for_each_gfn_valid_sp_with_gptes ( kvm , sp , gfn ) {
2010-06-04 21:53:07 +08:00
r = 1 ;
2010-06-04 21:56:11 +08:00
kvm_mmu_prepare_zap_page ( kvm , sp , & invalid_list ) ;
2010-06-04 21:53:07 +08:00
}
2010-06-04 21:55:29 +08:00
kvm_mmu_commit_zap_page ( kvm , & invalid_list ) ;
2021-02-02 10:57:24 -08:00
write_unlock ( & kvm - > mmu_lock ) ;
2011-09-22 17:02:48 +08:00
2007-01-05 16:36:45 -08:00
return r ;
[PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 16:36:43 -08:00
}
2021-02-12 16:50:15 -08:00
static int kvm_mmu_unprotect_page_virt ( struct kvm_vcpu * vcpu , gva_t gva )
{
gpa_t gpa ;
int r ;
2022-02-10 08:00:56 -05:00
if ( vcpu - > arch . mmu - > root_role . direct )
2021-02-12 16:50:15 -08:00
return 0 ;
gpa = kvm_mmu_gva_to_gpa_read ( vcpu , gva , NULL ) ;
r = kvm_mmu_unprotect_page ( vcpu - > kvm , gpa > > PAGE_SHIFT ) ;
return r ;
}
[PATCH] KVM: MMU: Shadow page table caching
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 16:36:43 -08:00
2021-11-15 15:45:54 -08:00
static void kvm_unsync_page ( struct kvm * kvm , struct kvm_mmu_page * sp )
2010-05-24 15:40:07 +08:00
{
trace_kvm_mmu_unsync_page ( sp ) ;
2021-11-15 15:45:54 -08:00
+ + kvm - > stat . mmu_unsync ;
2010-05-24 15:40:07 +08:00
sp - > unsync = 1 ;
kvm_mmu_mark_parents_unsync ( sp ) ;
}
2021-06-22 10:56:58 -07:00
/*
* Attempt to unsync any shadow pages that can be reached by the specified gfn ,
* KVM is creating a writable mapping for said gfn . Returns 0 if all pages
* were marked unsync ( or if there is no shadow page ) , - EPERM if the SPTE must
* be write - protected .
*/
2021-11-15 15:45:58 -08:00
int mmu_try_to_unsync_pages ( struct kvm * kvm , const struct kvm_memory_slot * slot ,
2021-09-29 09:19:32 -04:00
gfn_t gfn , bool can_unsync , bool prefetch )
2008-09-23 13:18:39 -03:00
{
2016-02-24 17:51:15 +08:00
struct kvm_mmu_page * sp ;
KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-12 11:18:15 -07:00
bool locked = false ;
2008-09-23 13:18:39 -03:00
2021-06-22 10:56:58 -07:00
/*
* Force write - protection if the page is being tracked . Note , the page
* track machinery is used to write - protect upper - level shadow pages ,
* i . e . this guards the role . level = = 4 K assertion below !
*/
2023-07-28 18:35:30 -07:00
if ( kvm_gfn_is_write_tracked ( kvm , slot , gfn ) )
2021-06-22 10:56:58 -07:00
return - EPERM ;
2010-05-24 15:40:07 +08:00
2021-06-22 10:56:58 -07:00
/*
* The page is not write - tracked , mark existing shadow pages unsync
* unless KVM is synchronizing an unsync SP ( can_unsync = false ) . In
* that case , KVM must complete emulation of the guest TLB flush before
* allowing shadow pages to become unsync ( writable by the guest ) .
*/
2022-04-20 21:12:03 +08:00
for_each_gfn_valid_sp_with_gptes ( kvm , sp , gfn ) {
KVM: MMU: fix writable sync sp mapping
While we sync many unsync sp at one time(in mmu_sync_children()),
we may mapping the spte writable, it's dangerous, if one unsync
sp's mapping gfn is another unsync page's gfn.
For example:
SP1.pte[0] = P
SP2.gfn's pfn = P
[SP1.pte[0] = SP2.gfn's pfn]
First, we write protected SP1 and SP2, but SP1 and SP2 are still the
unsync sp.
Then, sync SP1 first, it will detect SP1.pte[0].gfn only has one unsync-sp,
that is SP2, so it will mapping it writable, but we plan to sync SP2 soon,
at this point, the SP2->unsync is not reliable since later we sync SP2 but
SP2->gfn is already writable.
So the final result is: SP2 is the sync page but SP2.gfn is writable.
This bug will corrupt guest's page table, fixed by mark read-only mapping
if the mapped gfn has shadow pages.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-06-30 16:02:02 +08:00
if ( ! can_unsync )
2021-06-22 10:56:58 -07:00
return - EPERM ;
KVM: MMU: fix writable sync sp mapping
While we sync many unsync sp at one time(in mmu_sync_children()),
we may mapping the spte writable, it's dangerous, if one unsync
sp's mapping gfn is another unsync page's gfn.
For example:
SP1.pte[0] = P
SP2.gfn's pfn = P
[SP1.pte[0] = SP2.gfn's pfn]
First, we write protected SP1 and SP2, but SP1 and SP2 are still the
unsync sp.
Then, sync SP1 first, it will detect SP1.pte[0].gfn only has one unsync-sp,
that is SP2, so it will mapping it writable, but we plan to sync SP2 soon,
at this point, the SP2->unsync is not reliable since later we sync SP2 but
SP2->gfn is already writable.
So the final result is: SP2 is the sync page but SP2.gfn is writable.
This bug will corrupt guest's page table, fixed by mark read-only mapping
if the mapped gfn has shadow pages.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-06-30 16:02:02 +08:00
2016-02-24 17:51:15 +08:00
if ( sp - > unsync )
continue ;
2010-05-24 15:40:07 +08:00
2021-09-29 09:19:32 -04:00
if ( prefetch )
2021-09-18 08:56:35 +08:00
return - EEXIST ;
KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-12 11:18:15 -07:00
/*
* TDP MMU page faults require an additional spinlock as they
* run with mmu_lock held for read , not write , and the unsync
* logic is not thread safe . Take the spinklock regardless of
* the MMU type to avoid extra conditionals / parameters , there ' s
* no meaningful penalty if mmu_lock is held for write .
*/
if ( ! locked ) {
locked = true ;
2021-11-15 15:45:54 -08:00
spin_lock ( & kvm - > arch . mmu_unsync_pages_lock ) ;
KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-12 11:18:15 -07:00
/*
* Recheck after taking the spinlock , a different vCPU
* may have since marked the page unsync . A false
2023-11-25 03:34:00 -05:00
* negative on the unprotected check above is not
KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-12 11:18:15 -07:00
* possible as clearing sp - > unsync _must_ hold mmu_lock
2023-11-25 03:34:00 -05:00
* for write , i . e . unsync cannot transition from 1 - > 0
KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-12 11:18:15 -07:00
* while this CPU holds mmu_lock for read ( or write ) .
*/
if ( READ_ONCE ( sp - > unsync ) )
continue ;
}
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( sp - > role . level ! = PG_LEVEL_4K ) ;
2021-11-15 15:45:54 -08:00
kvm_unsync_page ( kvm , sp ) ;
2008-09-23 13:18:39 -03:00
}
KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-12 11:18:15 -07:00
if ( locked )
2021-11-15 15:45:54 -08:00
spin_unlock ( & kvm - > arch . mmu_unsync_pages_lock ) ;
2016-02-24 17:51:11 +08:00
2018-06-27 14:59:05 -07:00
/*
* We need to ensure that the marking of unsync pages is visible
* before the SPTE is updated to allow writes because
* kvm_mmu_sync_roots ( ) checks the unsync flags without holding
* the MMU lock and so can race with this . If the SPTE was updated
* before the page had been marked as unsync - ed , something like the
* following could happen :
*
* CPU 1 CPU 2
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* 1.2 Host updates SPTE
* to be writable
* 2.1 Guest writes a GPTE for GVA X .
* ( GPTE being in the guest page table shadowed
* by the SP from CPU 1. )
* This reads SPTE during the page table walk .
* Since SPTE . W is read as 1 , there is no
* fault .
*
* 2.2 Guest issues TLB flush .
* That causes a VM Exit .
*
2021-06-22 10:56:58 -07:00
* 2.3 Walking of unsync pages sees sp - > unsync is
* false and skips the page .
2018-06-27 14:59:05 -07:00
*
* 2.4 Guest accesses GVA X .
* Since the mapping in the SP was not updated ,
* so the old mapping for GVA X incorrectly
* gets used .
* 1.1 Host marks SP
* as unsync
* ( sp - > unsync = true )
*
* The write barrier below ensures that 1.1 happens before 1.2 and thus
2021-10-19 19:01:53 +08:00
* the situation in 2.4 does not arise . It pairs with the read barrier
* in is_unsync_root ( ) , placed between 2.1 ' s load of SPTE . W and 2.3 .
2018-06-27 14:59:05 -07:00
*/
smp_wmb ( ) ;
2021-06-22 10:56:58 -07:00
return 0 ;
2008-09-23 13:18:39 -03:00
}
2021-08-13 20:35:03 +00:00
static int mmu_set_spte ( struct kvm_vcpu * vcpu , struct kvm_memory_slot * slot ,
u64 * sptep , unsigned int pte_access , gfn_t gfn ,
2021-08-17 07:49:47 -04:00
kvm_pfn_t pfn , struct kvm_page_fault * fault )
2008-09-23 13:18:30 -03:00
{
2021-08-17 07:22:32 -04:00
struct kvm_mmu_page * sp = sptep_to_sp ( sptep ) ;
2021-08-17 07:42:10 -04:00
int level = sp - > role . level ;
2008-09-23 13:18:30 -03:00
int was_rmapped = 0 ;
2020-09-23 15:04:24 -07:00
int ret = RET_PF_FIXED ;
2018-07-24 08:17:07 +00:00
bool flush = false ;
2021-08-17 07:32:09 -04:00
bool wrprot ;
2021-08-17 07:22:32 -04:00
u64 spte ;
2008-09-23 13:18:30 -03:00
2021-08-17 07:49:47 -04:00
/* Prefetching always gets a writable pfn. */
bool host_writable = ! fault | | fault - > map_writable ;
2021-09-29 09:19:32 -04:00
bool prefetch = ! fault | | fault - > prefetch ;
2021-08-17 07:49:47 -04:00
bool write_fault = fault & & fault - > write ;
2008-09-23 13:18:30 -03:00
2021-02-25 12:47:32 -08:00
if ( unlikely ( is_noslot_pfn ( pfn ) ) ) {
2022-04-23 03:47:49 +00:00
vcpu - > stat . pf_mmio_spte_created + + ;
2021-02-25 12:47:32 -08:00
mark_mmio_spte ( vcpu , sptep , gfn , pte_access ) ;
return RET_PF_EMULATE ;
}
2015-11-20 17:44:55 +09:00
if ( is_shadow_present_pte ( * sptep ) ) {
2008-09-23 13:18:30 -03:00
/*
* If we overwrite a PTE page pointer with a 2 MB PMD , unlink
* the parent of the now unreachable PTE .
*/
2020-04-27 17:54:22 -07:00
if ( level > PG_LEVEL_4K & & ! is_large_pte ( * sptep ) ) {
2008-09-23 13:18:30 -03:00
struct kvm_mmu_page * child ;
2009-06-10 14:24:23 +03:00
u64 pte = * sptep ;
2008-09-23 13:18:30 -03:00
2022-10-19 16:56:16 +00:00
child = spte_to_child_sp ( pte ) ;
2023-07-28 17:47:21 -07:00
drop_parent_pte ( vcpu - > kvm , child , sptep ) ;
2018-07-24 08:17:07 +00:00
flush = true ;
2009-06-10 14:24:23 +03:00
} else if ( pfn ! = spte_to_pfn ( * sptep ) ) {
2011-07-12 03:28:04 +08:00
drop_spte ( vcpu - > kvm , sptep ) ;
2018-07-24 08:17:07 +00:00
flush = true ;
2009-02-18 14:08:59 +01:00
} else
was_rmapped = 1 ;
2008-09-23 13:18:30 -03:00
}
2009-07-27 16:30:44 +02:00
2021-09-29 09:19:32 -04:00
wrprot = make_spte ( vcpu , sp , slot , pte_access , gfn , pfn , * sptep , prefetch ,
2021-08-17 07:43:19 -04:00
true , host_writable , & spte ) ;
2021-08-17 07:22:32 -04:00
if ( * sptep = = spte ) {
ret = RET_PF_SPURIOUS ;
} else {
flush | = mmu_spte_update ( sptep , spte ) ;
2022-03-02 12:24:57 +02:00
trace_kvm_mmu_set_spte ( level , gfn , sptep ) ;
2021-08-17 07:22:32 -04:00
}
2021-08-17 07:32:09 -04:00
if ( wrprot ) {
2008-09-23 13:18:30 -03:00
if ( write_fault )
2017-08-17 15:03:32 +02:00
ret = RET_PF_EMULATE ;
2008-09-23 13:18:31 -03:00
}
2018-12-06 21:21:09 +08:00
2021-08-17 07:22:32 -04:00
if ( flush )
2022-10-10 20:19:17 +08:00
kvm_flush_remote_tlbs_gfn ( vcpu - > kvm , gfn , level ) ;
2008-09-23 13:18:30 -03:00
2021-08-02 21:46:05 -07:00
if ( ! was_rmapped ) {
2021-08-17 07:22:32 -04:00
WARN_ON_ONCE ( ret = = RET_PF_SPURIOUS ) ;
KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.
In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.
Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.
While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:04 -04:00
rmap_add ( vcpu , slot , sptep , gfn , pte_access ) ;
} else {
/* Already rmapped but the pte_access bits may have changed. */
2022-07-12 02:07:22 +00:00
kvm_mmu_page_set_access ( sp , spte_index ( sptep ) , pte_access ) ;
2007-12-09 17:40:31 +02:00
}
2012-08-03 15:42:10 +08:00
2017-08-17 15:03:32 +02:00
return ret ;
2007-12-09 17:40:31 +02:00
}
2010-08-22 19:12:48 +08:00
static int direct_pte_prefetch_many ( struct kvm_vcpu * vcpu ,
struct kvm_mmu_page * sp ,
u64 * start , u64 * end )
{
struct page * pages [ PTE_PREFETCH_NUM ] ;
2015-05-19 16:01:50 +02:00
struct kvm_memory_slot * slot ;
2020-02-03 15:09:09 -08:00
unsigned int access = sp - > role . access ;
2010-08-22 19:12:48 +08:00
int i , ret ;
gfn_t gfn ;
2022-07-12 02:07:22 +00:00
gfn = kvm_mmu_page_get_gfn ( sp , spte_index ( start ) ) ;
2015-05-19 16:01:50 +02:00
slot = gfn_to_memslot_dirty_bitmap ( vcpu , gfn , access & ACC_WRITE_MASK ) ;
if ( ! slot )
2010-08-22 19:12:48 +08:00
return - 1 ;
2015-05-19 16:01:50 +02:00
ret = gfn_to_page_many_atomic ( slot , gfn , pages , end - start ) ;
2010-08-22 19:12:48 +08:00
if ( ret < = 0 )
return - 1 ;
2019-01-03 16:22:21 -08:00
for ( i = 0 ; i < ret ; i + + , gfn + + , start + + ) {
2021-08-13 20:35:03 +00:00
mmu_set_spte ( vcpu , slot , start , access , gfn ,
2021-08-17 07:49:47 -04:00
page_to_pfn ( pages [ i ] ) , NULL ) ;
2019-01-03 16:22:21 -08:00
put_page ( pages [ i ] ) ;
}
2010-08-22 19:12:48 +08:00
return 0 ;
}
static void __direct_pte_prefetch ( struct kvm_vcpu * vcpu ,
struct kvm_mmu_page * sp , u64 * sptep )
{
u64 * spte , * start = NULL ;
int i ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( ! sp - > role . direct ) ;
2010-08-22 19:12:48 +08:00
2022-07-12 02:07:22 +00:00
i = spte_index ( sptep ) & ~ ( PTE_PREFETCH_NUM - 1 ) ;
2010-08-22 19:12:48 +08:00
spte = sp - > spt + i ;
for ( i = 0 ; i < PTE_PREFETCH_NUM ; i + + , spte + + ) {
2011-07-12 03:28:04 +08:00
if ( is_shadow_present_pte ( * spte ) | | spte = = sptep ) {
2010-08-22 19:12:48 +08:00
if ( ! start )
continue ;
if ( direct_pte_prefetch_many ( vcpu , sp , start , spte ) < 0 )
2021-08-18 23:56:15 +00:00
return ;
2010-08-22 19:12:48 +08:00
start = NULL ;
} else if ( ! start )
start = spte ;
}
2021-08-18 23:56:15 +00:00
if ( start )
direct_pte_prefetch_many ( vcpu , sp , start , spte ) ;
2010-08-22 19:12:48 +08:00
}
static void direct_pte_prefetch ( struct kvm_vcpu * vcpu , u64 * sptep )
{
struct kvm_mmu_page * sp ;
2020-06-22 13:20:33 -07:00
sp = sptep_to_sp ( sptep ) ;
2017-06-30 17:26:31 -07:00
2010-08-22 19:12:48 +08:00
/*
2017-06-30 17:26:31 -07:00
* Without accessed bits , there ' s no way to distinguish between
* actually accessed translations and prefetched , so disable pte
* prefetch if accessed bits aren ' t available .
2010-08-22 19:12:48 +08:00
*/
2017-06-30 17:26:31 -07:00
if ( sp_ad_disabled ( sp ) )
2010-08-22 19:12:48 +08:00
return ;
2020-04-27 17:54:22 -07:00
if ( sp - > role . level > PG_LEVEL_4K )
2010-08-22 19:12:48 +08:00
return ;
2021-02-22 11:45:22 +09:00
/*
* If addresses are being invalidated , skip prefetching to avoid
* accidentally prefetching those addresses .
*/
2022-08-16 20:53:22 +08:00
if ( unlikely ( vcpu - > kvm - > mmu_invalidate_in_progress ) )
2021-02-22 11:45:22 +09:00
return ;
2010-08-22 19:12:48 +08:00
__direct_pte_prefetch ( vcpu , sp , sptep ) ;
}
2022-07-15 23:21:05 +00:00
/*
* Lookup the mapping level for @ gfn in the current mm .
*
* WARNING ! Use of host_pfn_mapping_level ( ) requires the caller and the end
* consumer to be tied into KVM ' s handlers for MMU notifier events !
*
* There are several ways to safely use this helper :
*
2023-10-27 11:21:45 -07:00
* - Check mmu_invalidate_retry_gfn ( ) after grabbing the mapping level , before
2022-07-15 23:21:05 +00:00
* consuming it . In this case , mmu_lock doesn ' t need to be held during the
* lookup , but it does need to be held while checking the MMU notifier .
*
* - Hold mmu_lock AND ensure there is no in - progress MMU notifier invalidation
* event for the hva . This can be done by explicit checking the MMU notifier
* or by ensuring that KVM already has a valid mapping that covers the hva .
*
* - Do not use the result to install new mappings , e . g . use the host mapping
* level only to decide whether or not to zap an entry . In this case , it ' s
* not required to hold mmu_lock ( though it ' s highly likely the caller will
* want to hold mmu_lock anyways , e . g . to modify SPTEs ) .
*
* Note ! The lookup can still race with modifications to host page tables , but
* the above " rules " ensure KVM will not _consume_ the result of the walk if a
* race with the primary MMU occurs .
*/
2022-07-15 23:21:04 +00:00
static int host_pfn_mapping_level ( struct kvm * kvm , gfn_t gfn ,
2021-04-01 16:37:24 -07:00
const struct kvm_memory_slot * slot )
2020-01-08 12:24:41 -08:00
{
2022-04-29 01:04:14 +00:00
int level = PG_LEVEL_4K ;
2020-01-08 12:24:41 -08:00
unsigned long hva ;
2022-04-29 03:17:57 +00:00
unsigned long flags ;
pgd_t pgd ;
p4d_t p4d ;
pud_t pud ;
pmd_t pmd ;
2020-01-08 12:24:41 -08:00
2020-01-08 12:24:46 -08:00
/*
* Note , using the already - retrieved memslot and __gfn_to_hva_memslot ( )
* is not solely for performance , it ' s also necessary to avoid the
* " writable " check in __gfn_to_hva_many ( ) , which will always fail on
* read - only memslots due to gfn_to_hva ( ) assuming writes . Earlier
* page fault steps have already verified the guest isn ' t writing a
* read - only memslot .
*/
2020-01-08 12:24:41 -08:00
hva = __gfn_to_hva_memslot ( slot , gfn ) ;
2022-04-29 03:17:57 +00:00
/*
2022-07-15 23:21:05 +00:00
* Disable IRQs to prevent concurrent tear down of host page tables ,
* e . g . if the primary MMU promotes a P * D to a huge page and then frees
* the original page table .
2022-04-29 03:17:57 +00:00
*/
local_irq_save ( flags ) ;
2022-07-15 23:21:05 +00:00
/*
* Read each entry once . As above , a non - leaf entry can be promoted to
* a huge page _during_ this walk . Re - reading the entry could send the
2024-03-05 12:37:49 +08:00
* walk into the weeks , e . g . p * d_leaf ( ) returns false ( sees the old
2022-07-15 23:21:05 +00:00
* value ) and then p * d_offset ( ) walks into the target huge page instead
* of the old page table ( sees the new value ) .
*/
2022-04-29 03:17:57 +00:00
pgd = READ_ONCE ( * pgd_offset ( kvm - > mm , hva ) ) ;
if ( pgd_none ( pgd ) )
goto out ;
p4d = READ_ONCE ( * p4d_offset ( & pgd , hva ) ) ;
if ( p4d_none ( p4d ) | | ! p4d_present ( p4d ) )
goto out ;
2020-01-08 12:24:41 -08:00
2022-04-29 03:17:57 +00:00
pud = READ_ONCE ( * pud_offset ( & p4d , hva ) ) ;
if ( pud_none ( pud ) | | ! pud_present ( pud ) )
goto out ;
2024-03-05 12:37:48 +08:00
if ( pud_leaf ( pud ) ) {
2022-04-29 03:17:57 +00:00
level = PG_LEVEL_1G ;
goto out ;
}
pmd = READ_ONCE ( * pmd_offset ( & pud , hva ) ) ;
if ( pmd_none ( pmd ) | | ! pmd_present ( pmd ) )
goto out ;
2024-03-05 12:37:47 +08:00
if ( pmd_leaf ( pmd ) )
2022-04-29 03:17:57 +00:00
level = PG_LEVEL_2M ;
out :
local_irq_restore ( flags ) ;
2020-01-08 12:24:41 -08:00
return level ;
}
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 11:22:02 -07:00
static int __kvm_mmu_max_mapping_level ( struct kvm * kvm ,
const struct kvm_memory_slot * slot ,
gfn_t gfn , int max_level , bool is_private )
2021-02-12 16:50:04 -08:00
{
struct kvm_lpage_info * linfo ;
2021-08-06 07:05:58 -04:00
int host_level ;
2021-02-12 16:50:04 -08:00
max_level = min ( max_level , max_huge_page_level ) ;
for ( ; max_level > PG_LEVEL_4K ; max_level - - ) {
linfo = lpage_info_slot ( gfn , slot , max_level ) ;
if ( ! linfo - > disallow_lpage )
break ;
}
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 11:22:02 -07:00
if ( is_private )
return max_level ;
2021-02-12 16:50:04 -08:00
if ( max_level = = PG_LEVEL_4K )
return PG_LEVEL_4K ;
2022-07-15 23:21:04 +00:00
host_level = host_pfn_mapping_level ( kvm , gfn , slot ) ;
2021-08-06 07:05:58 -04:00
return min ( host_level , max_level ) ;
2021-02-12 16:50:04 -08:00
}
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 11:22:02 -07:00
int kvm_mmu_max_mapping_level ( struct kvm * kvm ,
const struct kvm_memory_slot * slot , gfn_t gfn ,
int max_level )
{
bool is_private = kvm_slot_can_be_private ( slot ) & &
kvm_mem_is_private ( kvm , gfn ) ;
return __kvm_mmu_max_mapping_level ( kvm , slot , gfn , max_level , is_private ) ;
}
2021-08-07 09:21:53 -04:00
void kvm_mmu_hugepage_adjust ( struct kvm_vcpu * vcpu , struct kvm_page_fault * fault )
2019-12-06 15:57:25 -08:00
{
2021-09-24 05:05:26 -04:00
struct kvm_memory_slot * slot = fault - > slot ;
2020-01-08 12:24:40 -08:00
kvm_pfn_t mask ;
2021-08-07 09:21:53 -04:00
fault - > huge_page_disallowed = fault - > exec & & fault - > nx_huge_page_workaround_enabled ;
2020-09-23 11:37:31 -07:00
2021-08-07 09:21:53 -04:00
if ( unlikely ( fault - > max_level = = PG_LEVEL_4K ) )
return ;
2020-01-08 12:24:40 -08:00
2022-04-29 01:04:16 +00:00
if ( is_error_noslot_pfn ( fault - > pfn ) )
2021-08-07 09:21:53 -04:00
return ;
2020-01-08 12:24:40 -08:00
2021-09-24 05:05:26 -04:00
if ( kvm_slot_dirty_track_enabled ( slot ) )
2021-08-07 09:21:53 -04:00
return ;
2020-01-08 12:24:46 -08:00
2020-09-23 11:37:31 -07:00
/*
* Enforce the iTLB multihit workaround after capturing the requested
* level , which will be used to do precise , accurate accounting .
*/
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 11:22:02 -07:00
fault - > req_level = __kvm_mmu_max_mapping_level ( vcpu - > kvm , slot ,
fault - > gfn , fault - > max_level ,
fault - > is_private ) ;
2021-08-07 09:21:53 -04:00
if ( fault - > req_level = = PG_LEVEL_4K | | fault - > huge_page_disallowed )
return ;
2019-12-06 15:57:25 -08:00
/*
2022-08-16 20:53:22 +08:00
* mmu_invalidate_retry ( ) was successful and mmu_lock is held , so
2020-01-08 12:24:40 -08:00
* the pmd can ' t be split from under us .
2019-12-06 15:57:25 -08:00
*/
2021-08-07 09:21:53 -04:00
fault - > goal_level = fault - > req_level ;
mask = KVM_PAGES_PER_HPAGE ( fault - > goal_level ) - 1 ;
VM_BUG_ON ( ( fault - > gfn & mask ) ! = ( fault - > pfn & mask ) ) ;
fault - > pfn & = ~ mask ;
2019-12-06 15:57:25 -08:00
}
2021-08-06 04:35:50 -04:00
void disallowed_hugepage_adjust ( struct kvm_page_fault * fault , u64 spte , int cur_level )
2019-11-04 12:22:02 +01:00
{
2021-08-06 04:35:50 -04:00
if ( cur_level > PG_LEVEL_4K & &
cur_level = = fault - > goal_level & &
2019-11-04 12:22:02 +01:00
is_shadow_present_pte ( spte ) & &
2022-10-19 16:56:17 +00:00
! is_large_pte ( spte ) & &
spte_to_child_sp ( spte ) - > nx_huge_page_disallowed ) {
2019-11-04 12:22:02 +01:00
/*
2022-09-21 10:35:46 -07:00
* A small SPTE exists for this pfn , but FNAME ( fetch ) ,
* direct_map ( ) , or kvm_tdp_mmu_map ( ) would like to create a
* large PTE instead : just force them to go down another level ,
* patching back for them into pfn the next 9 bits of the
* address .
2019-11-04 12:22:02 +01:00
*/
2021-08-06 04:35:50 -04:00
u64 page_mask = KVM_PAGES_PER_HPAGE ( cur_level ) -
KVM_PAGES_PER_HPAGE ( cur_level - 1 ) ;
fault - > pfn | = fault - > gfn & page_mask ;
fault - > goal_level - - ;
2019-11-04 12:22:02 +01:00
}
}
2022-09-21 10:35:46 -07:00
static int direct_map ( struct kvm_vcpu * vcpu , struct kvm_page_fault * fault )
2008-08-22 19:28:04 +03:00
{
2019-06-24 13:06:21 +02:00
struct kvm_shadow_walk_iterator it ;
2008-08-22 19:28:04 +03:00
struct kvm_mmu_page * sp ;
2021-08-07 09:21:53 -04:00
int ret ;
2021-08-06 04:35:50 -04:00
gfn_t base_gfn = fault - > gfn ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
2021-08-07 09:21:53 -04:00
kvm_mmu_hugepage_adjust ( vcpu , fault ) ;
2019-12-06 15:57:26 -08:00
2021-08-06 04:35:50 -04:00
trace_kvm_mmu_spte_requested ( fault ) ;
2021-08-06 04:35:50 -04:00
for_each_shadow_entry ( vcpu , fault - > addr , it ) {
2019-11-04 12:22:02 +01:00
/*
* We cannot overwrite existing page tables with an NX
* large page , as the leaf could be executable .
*/
2021-08-07 09:21:53 -04:00
if ( fault - > nx_huge_page_workaround_enabled )
2021-08-06 04:35:50 -04:00
disallowed_hugepage_adjust ( fault , * it . sptep , it . level ) ;
2019-11-04 12:22:02 +01:00
2022-10-10 20:19:12 +08:00
base_gfn = gfn_round_for_level ( fault - > gfn , it . level ) ;
2021-08-07 09:21:53 -04:00
if ( it . level = = fault - > goal_level )
2008-12-25 14:54:25 +02:00
break ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
2022-06-22 15:26:51 -04:00
sp = kvm_mmu_get_child_sp ( vcpu , it . sptep , base_gfn , true , ACC_ALL ) ;
KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE. This is done by drop_large_spte().
However, dropping the large SPTE is really only necessary before the
sp is installed. While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page(). Move the call
there instead of having it in each and every caller.
To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.
Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:07 -04:00
if ( sp = = ERR_PTR ( - EEXIST ) )
continue ;
2021-07-02 15:04:50 -07:00
link_shadow_page ( vcpu , it . sptep , sp ) ;
KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs
Account and track NX huge pages for nonpaging MMUs so that a future
enhancement to precisely check if a shadow page can't be replaced by a NX
huge page doesn't get false positives. Without correct tracking, KVM can
get stuck in a loop if an instruction is fetching and writing data on the
same huge page, e.g. KVM installs a small executable page on the fetch
fault, replaces it with an NX huge page on the write fault, and faults
again on the fetch.
Alternatively, and perhaps ideally, KVM would simply not enforce the
workaround for nonpaging MMUs. The guest has no page tables to abuse
and KVM is guaranteed to switch to a different MMU on CR0.PG being
toggled so there's no security or performance concerns. However, getting
make_spte() to play nice now and in the future is unnecessarily complex.
In the current code base, make_spte() can enforce the mitigation if TDP
is enabled or the MMU is indirect, but make_spte() may not always have a
vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
promotion when disabling dirty logging.
Without a vCPU/MMU, KVM could either pass in the correct information
and/or derive it from the shadow page, but the former is ugly and the
latter subtly non-trivial due to the possibility of direct shadow pages
in indirect MMUs. Given that using shadow paging with an unpaged guest
is far from top priority _and_ has been subjected to the workaround since
its inception, keep it simple and just fix the accounting glitch.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221019165618.927057-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-19 16:56:13 +00:00
if ( fault - > huge_page_disallowed )
2022-10-19 16:56:12 +00:00
account_nx_huge_page ( vcpu - > kvm , sp ,
2022-10-19 16:56:11 +00:00
fault - > req_level > = it . level ) ;
2008-12-25 14:54:25 +02:00
}
2019-06-24 13:06:21 +02:00
2021-09-06 20:25:46 +08:00
if ( WARN_ON_ONCE ( it . level ! = fault - > goal_level ) )
return - EFAULT ;
2021-08-13 20:35:03 +00:00
ret = mmu_set_spte ( vcpu , fault - > slot , it . sptep , ACC_ALL ,
2021-08-17 07:49:47 -04:00
base_gfn , fault - > pfn , fault ) ;
2020-09-23 15:04:25 -07:00
if ( ret = = RET_PF_SPURIOUS )
return ret ;
2019-06-24 13:06:21 +02:00
direct_pte_prefetch ( vcpu , it . sptep ) ;
return ret ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2022-09-21 10:35:41 -07:00
static void kvm_send_hwpoison_signal ( struct kvm_memory_slot * slot , gfn_t gfn )
2010-05-31 14:28:19 +08:00
{
2022-09-21 10:35:41 -07:00
unsigned long hva = gfn_to_hva_memslot ( slot , gfn ) ;
send_sig_mceerr ( BUS_MCEERR_AR , ( void __user * ) hva , PAGE_SHIFT , current ) ;
2010-05-31 14:28:19 +08:00
}
2022-09-21 10:35:41 -07:00
static int kvm_handle_error_pfn ( struct kvm_vcpu * vcpu , struct kvm_page_fault * fault )
2010-05-31 14:28:19 +08:00
{
2022-09-21 10:35:41 -07:00
if ( is_sigpending_pfn ( fault - > pfn ) ) {
2022-10-11 15:59:47 -04:00
kvm_handle_signal_exit ( vcpu ) ;
return - EINTR ;
}
2012-08-21 11:02:51 +08:00
/*
* Do not cache the mmio info caused by writing the readonly gfn
* into the spte otherwise read access on readonly gfn also can
* caused mmio page fault and treat it as mmio access .
*/
2022-09-21 10:35:41 -07:00
if ( fault - > pfn = = KVM_PFN_ERR_RO_FAULT )
2017-08-17 15:03:32 +02:00
return RET_PF_EMULATE ;
2012-08-21 11:02:51 +08:00
2022-09-21 10:35:41 -07:00
if ( fault - > pfn = = KVM_PFN_ERR_HWPOISON ) {
kvm_send_hwpoison_signal ( fault - > slot , fault - > gfn ) ;
2017-08-17 15:03:32 +02:00
return RET_PF_RETRY ;
2011-07-12 03:29:38 +08:00
}
2010-07-07 20:16:45 +03:00
Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown"
The bug that led to commit 95e057e25892eaa48cad1e2d637b80d0f1a4fac5
was a benign warning (no adverse affects other than the warning
itself) that was detected by syzkaller. Further inspection shows
that the WARN_ON in question, in handle_ept_misconfig(), is
unnecessary and flawed (this was also briefly discussed in the
original patch: https://patchwork.kernel.org/patch/10204649).
* The WARN_ON is unnecessary as kvm_mmu_page_fault() will WARN
if reserved bits are set in the SPTEs, i.e. it covers the case
where an EPT misconfig occurred because of a KVM bug.
* The WARN_ON is flawed because it will fire on any system error
code that is hit while handling the fault, e.g. -ENOMEM can be
returned by mmu_topup_memory_caches() while handling a legitmate
MMIO EPT misconfig.
The original behavior of returning -EFAULT when userspace munmaps
an HVA without first removing the memslot is correct and desirable,
i.e. KVM is letting userspace know it has generated a bad address.
Returning RET_PF_EMULATE masks the WARN_ON in the EPT misconfig path,
but does not fix the underlying bug, i.e. the WARN_ON is bogus.
Furthermore, returning RET_PF_EMULATE has the unwanted side effect of
causing KVM to attempt to emulate an instruction on any page fault
with an invalid HVA translation, e.g. a not-present EPT violation
on a VM_PFNMAP VMA whose fault handler failed to insert a PFN.
* There is no guarantee that the fault is directly related to the
instruction, i.e. the fault could have been triggered by a side
effect memory access in the guest, e.g. while vectoring a #DB or
writing a tracing record. This could cause KVM to effectively
mask the fault if KVM doesn't model the behavior leading to the
fault, i.e. emulation could succeed and resume the guest.
* If emulation does fail, KVM will return EMULATION_FAILED instead
of -EFAULT, which is a red herring as the user will either debug
a bogus emulation attempt or scratch their head wondering why we
were attempting emulation in the first place.
TL;DR: revert to returning -EFAULT and remove the bogus WARN_ON in
handle_ept_misconfig in a future patch.
This reverts commit 95e057e25892eaa48cad1e2d637b80d0f1a4fac5.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-29 14:48:30 -07:00
return - EFAULT ;
2010-05-31 14:28:19 +08:00
}
2022-09-21 10:35:42 -07:00
static int kvm_handle_noslot_fault ( struct kvm_vcpu * vcpu ,
struct kvm_page_fault * fault ,
unsigned int access )
2011-07-12 03:29:38 +08:00
{
2022-09-21 10:35:42 -07:00
gva_t gva = fault - > is_tdp ? 0 : fault - > addr ;
2011-07-12 03:29:38 +08:00
2022-09-21 10:35:42 -07:00
vcpu_cache_mmio_info ( vcpu , gva , fault - > gfn ,
access & shadow_mmio_access_mask ) ;
2021-08-06 04:35:50 -04:00
2022-09-21 10:35:42 -07:00
/*
* If MMIO caching is disabled , emulate immediately without
* touching the shadow page tables as attempting to install an
* MMIO SPTE will just be an expensive nop .
*/
if ( unlikely ( ! enable_mmio_caching ) )
return RET_PF_EMULATE ;
/*
* Do not create an MMIO SPTE for a gfn greater than host . MAXPHYADDR ,
* any guest that generates such gfns is running nested and is being
* tricked by L0 userspace ( you can observe gfn > L1 . MAXPHYADDR if and
* only if L1 ' s MAXPHYADDR is inaccurate with respect to the
* hardware ' s ) .
*/
if ( unlikely ( fault - > gfn > kvm_mmu_max_gfn ( ) ) )
return RET_PF_EMULATE ;
2011-07-12 03:29:38 +08:00
2022-04-23 03:47:46 +00:00
return RET_PF_CONTINUE ;
2011-07-12 03:29:38 +08:00
}
2021-08-06 04:35:50 -04:00
static bool page_fault_can_be_fast ( struct kvm_page_fault * fault )
2012-06-20 15:59:18 +08:00
{
2013-07-18 12:52:37 +08:00
/*
2022-04-23 03:47:45 +00:00
* Page faults with reserved bits set , i . e . faults on MMIO SPTEs , only
* reach the common page fault handler if the SPTE has an invalid MMIO
* generation number . Refreshing the MMIO generation needs to go down
* the slow path . Note , EPT Misconfigs do NOT set the PRESENT flag !
2013-07-18 12:52:37 +08:00
*/
2021-08-06 04:35:50 -04:00
if ( fault - > rsvd )
2013-07-18 12:52:37 +08:00
return false ;
2012-06-20 15:59:18 +08:00
/*
2016-12-06 16:46:16 -08:00
* # PF can be fast if :
*
KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path. Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.
In other words, don't attempt the fast path just because EPT is enabled.
Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios. Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.
The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults. Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.
On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles. So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.
Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:44 +00:00
* 1. The shadow page table entry is not present and A / D bits are
* disabled _by KVM_ , which could mean that the fault is potentially
* caused by access tracking ( if enabled ) . If A / D bits are enabled
* by KVM , but disabled by L1 for L2 , KVM is forced to disable A / D
* bits for L2 and employ access tracking , but the fast page fault
* mechanism only supports direct MMUs .
* 2. The shadow page table entry is present , the access is a write ,
* and no reserved bits are set ( MMIO SPTEs cannot be " fixed " ) , i . e .
* the fault was caused by a write - protection violation . If the
* SPTE is MMU - writable ( determined later ) , the fault can be fixed
* by setting the Writable bit , which can be done out of mmu_lock .
2012-06-20 15:59:18 +08:00
*/
2022-04-23 03:47:45 +00:00
if ( ! fault - > present )
return ! kvm_ad_enabled ( ) ;
/*
* Note , instruction fetches and writes are mutually exclusive , ignore
* the " exec " flag .
*/
return fault - > write ;
2012-06-20 15:59:18 +08:00
}
2016-12-06 16:46:12 -08:00
/*
* Returns true if the SPTE was fixed successfully . Otherwise ,
* someone else modified the SPTE from its original value .
*/
2023-02-02 18:27:51 +00:00
static bool fast_pf_fix_direct_spte ( struct kvm_vcpu * vcpu ,
struct kvm_page_fault * fault ,
u64 * sptep , u64 old_spte , u64 new_spte )
2012-06-20 15:59:18 +08:00
{
2015-01-28 10:54:25 +08:00
/*
* Theoretically we could also set dirty bit ( and flush TLB ) here in
* order to eliminate unnecessary PML logging . See comments in
* set_spte . But fast_page_fault is very unlikely to happen with PML
* enabled , so we do not do this . This might result in the same GPA
* to be logged in PML buffer again when the write really happens , and
* eventually to be called by mark_page_dirty twice . But it ' s also no
* harm . This also avoids the TLB flush needed after setting dirty bit
* so non - PML cases won ' t be impacted .
*
* Compare with set_spte where instead shadow_dirty_mask is set .
*/
2022-05-20 16:46:35 +02:00
if ( ! try_cmpxchg64 ( sptep , & old_spte , new_spte ) )
2016-12-06 16:46:12 -08:00
return false ;
2021-09-24 05:05:26 -04:00
if ( is_writable_pte ( new_spte ) & & ! is_writable_pte ( old_spte ) )
mark_page_dirty_in_slot ( vcpu - > kvm , fault - > slot , fault - > gfn ) ;
2012-06-20 15:59:18 +08:00
return true ;
}
2021-08-06 04:35:50 -04:00
static bool is_access_allowed ( struct kvm_page_fault * fault , u64 spte )
2016-12-21 20:29:32 -08:00
{
2021-08-06 04:35:50 -04:00
if ( fault - > exec )
2016-12-21 20:29:32 -08:00
return is_executable_pte ( spte ) ;
2021-08-06 04:35:50 -04:00
if ( fault - > write )
2016-12-21 20:29:32 -08:00
return is_writable_pte ( spte ) ;
/* Fault was on Read access */
return spte & PT_PRESENT_MASK ;
}
2021-07-13 22:09:55 +00:00
/*
* Returns the last level spte pointer of the shadow page walk for the given
* gpa , and sets * spte to the spte value . This spte may be non - preset . If no
* walk could be performed , returns NULL and * spte does not contain valid data .
*
* Contract :
* - Must be called between walk_shadow_page_lockless_ { begin , end } .
* - The returned sptep must not be used after walk_shadow_page_lockless_end .
*/
static u64 * fast_pf_get_last_sptep ( struct kvm_vcpu * vcpu , gpa_t gpa , u64 * spte )
{
struct kvm_shadow_walk_iterator iterator ;
u64 old_spte ;
u64 * sptep = NULL ;
for_each_shadow_entry_lockless ( vcpu , gpa , iterator , old_spte ) {
sptep = iterator . sptep ;
* spte = old_spte ;
}
return sptep ;
}
2012-06-20 15:59:18 +08:00
/*
2020-09-23 15:04:24 -07:00
* Returns one of RET_PF_INVALID , RET_PF_FIXED or RET_PF_SPURIOUS .
2012-06-20 15:59:18 +08:00
*/
2021-08-06 04:35:50 -04:00
static int fast_page_fault ( struct kvm_vcpu * vcpu , struct kvm_page_fault * fault )
2012-06-20 15:59:18 +08:00
{
2014-04-17 17:06:13 +08:00
struct kvm_mmu_page * sp ;
2020-09-23 15:04:24 -07:00
int ret = RET_PF_INVALID ;
2023-09-06 02:20:06 +08:00
u64 spte ;
u64 * sptep ;
2016-12-06 16:46:12 -08:00
uint retry_count = 0 ;
2012-06-20 15:59:18 +08:00
2021-08-06 04:35:50 -04:00
if ( ! page_fault_can_be_fast ( fault ) )
2020-09-23 15:04:24 -07:00
return ret ;
2012-06-20 15:59:18 +08:00
walk_shadow_page_lockless_begin ( vcpu ) ;
2016-12-06 16:46:12 -08:00
do {
2016-12-21 20:29:32 -08:00
u64 new_spte ;
2012-06-20 15:59:18 +08:00
2022-10-12 18:16:58 +00:00
if ( tdp_mmu_enabled )
2021-08-06 04:35:50 -04:00
sptep = kvm_tdp_mmu_fast_pf_get_last_sptep ( vcpu , fault - > addr , & spte ) ;
2021-07-13 22:09:55 +00:00
else
2021-08-06 04:35:50 -04:00
sptep = fast_pf_get_last_sptep ( vcpu , fault - > addr , & spte ) ;
2016-12-21 20:29:30 -08:00
2023-09-06 02:20:06 +08:00
/*
* It ' s entirely possible for the mapping to have been zapped
* by a different task , but the root page should always be
* available as the vCPU holds a reference to its root ( s ) .
*/
if ( WARN_ON_ONCE ( ! sptep ) )
spte = REMOVED_SPTE ;
2021-02-25 12:47:28 -08:00
if ( ! is_shadow_present_pte ( spte ) )
break ;
2021-07-13 22:09:55 +00:00
sp = sptep_to_sp ( sptep ) ;
2016-12-06 16:46:12 -08:00
if ( ! is_last_spte ( spte , sp - > role . level ) )
break ;
2012-06-20 15:59:18 +08:00
2016-12-06 16:46:12 -08:00
/*
2016-12-06 16:46:16 -08:00
* Check whether the memory access that caused the fault would
* still cause it if it were to be performed right now . If not ,
* then this is a spurious fault caused by TLB lazily flushed ,
* or some other CPU has already fixed the PTE after the
* current CPU took the fault .
2016-12-06 16:46:12 -08:00
*
* Need not check the access of upper level table entries since
* they are always ACC_ALL .
*/
2021-08-06 04:35:50 -04:00
if ( is_access_allowed ( fault , spte ) ) {
2020-09-23 15:04:24 -07:00
ret = RET_PF_SPURIOUS ;
2016-12-21 20:29:32 -08:00
break ;
}
2016-12-06 16:46:16 -08:00
2016-12-21 20:29:32 -08:00
new_spte = spte ;
KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path. Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.
In other words, don't attempt the fast path just because EPT is enabled.
Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios. Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.
The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults. Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.
On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles. So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.
Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:44 +00:00
/*
* KVM only supports fixing page faults outside of MMU lock for
* direct MMUs , nested MMUs are always indirect , and KVM always
* uses A / D bits for non - nested MMUs . Thus , if A / D bits are
* enabled , the SPTE can ' t be an access - tracked SPTE .
*/
if ( unlikely ( ! kvm_ad_enabled ( ) ) & & is_access_track_spte ( spte ) )
2016-12-21 20:29:32 -08:00
new_spte = restore_acc_track_spte ( new_spte ) ;
/*
KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path. Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.
In other words, don't attempt the fast path just because EPT is enabled.
Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios. Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.
The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults. Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.
On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles. So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.
Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:44 +00:00
* To keep things simple , only SPTEs that are MMU - writable can
* be made fully writable outside of mmu_lock , e . g . only SPTEs
* that were write - protected for dirty - logging or access
* tracking are handled here . Don ' t bother checking if the
* SPTE is writable to prioritize running with A / D bits enabled .
* The is_access_allowed ( ) check above handles the common case
* of the fault being spurious , and the SPTE is known to be
* shadow - present , i . e . except for access tracking restoration
* making the new SPTE writable , the check is wasteful .
2016-12-21 20:29:32 -08:00
*/
2022-04-23 03:47:41 +00:00
if ( fault - > write & & is_mmu_writable_spte ( spte ) ) {
2016-12-21 20:29:32 -08:00
new_spte | = PT_WRITABLE_MASK ;
2016-12-06 16:46:16 -08:00
/*
2021-11-03 17:33:59 -07:00
* Do not fix write - permission on the large spte when
* dirty logging is enabled . Since we only dirty the
* first page into the dirty - bitmap in
2016-12-21 20:29:32 -08:00
* fast_pf_fix_direct_spte ( ) , other pages are missed
* if its slot has dirty logging enabled .
*
* Instead , we let the slow page fault path create a
* normal spte to fix the access .
2016-12-06 16:46:16 -08:00
*/
2021-11-03 17:33:59 -07:00
if ( sp - > role . level > PG_LEVEL_4K & &
kvm_slot_dirty_track_enabled ( fault - > slot ) )
2016-12-06 16:46:16 -08:00
break ;
2016-12-06 16:46:12 -08:00
}
2012-06-20 15:59:18 +08:00
2016-12-06 16:46:16 -08:00
/* Verify that the fault can be handled in the fast path */
2016-12-21 20:29:32 -08:00
if ( new_spte = = spte | |
2021-08-06 04:35:50 -04:00
! is_access_allowed ( fault , new_spte ) )
2016-12-06 16:46:12 -08:00
break ;
/*
* Currently , fast page fault only works for direct mapping
* since the gfn is not stable for indirect shadow page . See
2020-04-14 18:48:36 +02:00
* Documentation / virt / kvm / locking . rst to get more detail .
2016-12-06 16:46:12 -08:00
*/
2021-09-24 05:05:26 -04:00
if ( fast_pf_fix_direct_spte ( vcpu , fault , sptep , spte , new_spte ) ) {
2020-09-23 15:04:24 -07:00
ret = RET_PF_FIXED ;
2016-12-06 16:46:12 -08:00
break ;
2020-09-23 15:04:24 -07:00
}
2016-12-06 16:46:12 -08:00
if ( + + retry_count > 4 ) {
KVM: x86: Unify pr_fmt to use module name for all KVM modules
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks
use consistent formatting across common x86, Intel, and AMD code. In
addition to providing consistent print formatting, using KBUILD_MODNAME,
e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and
SGX and ...) as technologies without generating weird messages, and
without causing naming conflicts with other kernel code, e.g. "SEV: ",
"tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems.
Opportunistically move away from printk() for prints that need to be
modified anyways, e.g. to drop a manual "kvm: " prefix.
Opportunistically convert a few SGX WARNs that are similarly modified to
WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good
that they would fire repeatedly and spam the kernel log without providing
unique information in each print.
Note, defining pr_fmt yields undesirable results for code that uses KVM's
printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem
as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
wrappers is relatively limited in KVM x86 code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Message-Id: <20221130230934.1014142-35-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:18 +00:00
pr_warn_once ( " Fast #PF retrying more than 4 times. \n " ) ;
2016-12-06 16:46:12 -08:00
break ;
}
} while ( true ) ;
2014-04-17 17:06:14 +08:00
2021-08-06 04:35:50 -04:00
trace_fast_page_fault ( vcpu , fault , sptep , spte , ret ) ;
2012-06-20 15:59:18 +08:00
walk_shadow_page_lockless_end ( vcpu ) ;
2022-04-23 03:47:49 +00:00
if ( ret ! = RET_PF_INVALID )
vcpu - > stat . pf_fast + + ;
2020-09-23 15:04:24 -07:00
return ret ;
2012-06-20 15:59:18 +08:00
}
2018-05-04 11:37:11 -07:00
static void mmu_free_root_page ( struct kvm * kvm , hpa_t * root_hpa ,
struct list_head * invalid_list )
2007-01-05 16:36:40 -08:00
{
2007-11-21 15:28:32 +02:00
struct kvm_mmu_page * sp ;
2007-01-05 16:36:40 -08:00
2018-05-04 11:37:11 -07:00
if ( ! VALID_PAGE ( * root_hpa ) )
2007-06-05 12:17:03 +03:00
return ;
2013-05-16 11:55:51 +03:00
2023-07-28 17:51:56 -07:00
sp = root_to_sp ( * root_hpa ) ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
if ( WARN_ON_ONCE ( ! sp ) )
2022-02-08 19:08:33 -05:00
return ;
2020-10-14 20:26:44 +02:00
2024-01-10 18:00:48 -08:00
if ( is_tdp_mmu_page ( sp ) ) {
lockdep_assert_held_read ( & kvm - > mmu_lock ) ;
2023-11-25 03:33:57 -05:00
kvm_tdp_mmu_put_root ( kvm , sp ) ;
2024-01-10 18:00:48 -08:00
} else {
lockdep_assert_held_write ( & kvm - > mmu_lock ) ;
if ( ! - - sp - > root_count & & sp - > role . invalid )
kvm_mmu_prepare_zap_page ( kvm , sp , invalid_list ) ;
}
2007-01-05 16:36:40 -08:00
2018-05-04 11:37:11 -07:00
* root_hpa = INVALID_PAGE ;
}
2018-06-27 14:59:17 -07:00
/* roots_to_free must be some combination of the KVM_MMU_ROOT_* flags */
2022-02-21 09:31:51 -05:00
void kvm_mmu_free_roots ( struct kvm * kvm , struct kvm_mmu * mmu ,
2018-10-08 21:28:07 +02:00
ulong roots_to_free )
2018-05-04 11:37:11 -07:00
{
2024-01-10 18:00:48 -08:00
bool is_tdp_mmu = tdp_mmu_enabled & & mmu - > root_role . direct ;
2018-05-04 11:37:11 -07:00
int i ;
LIST_HEAD ( invalid_list ) ;
2022-02-08 17:53:55 -05:00
bool free_active_root ;
2018-05-04 11:37:11 -07:00
2023-02-16 23:41:13 +08:00
WARN_ON_ONCE ( roots_to_free & ~ KVM_MMU_ROOTS_ALL ) ;
2018-06-27 14:59:20 -07:00
BUILD_BUG_ON ( KVM_MMU_NUM_PREV_ROOTS > = BITS_PER_LONG ) ;
2018-05-04 11:37:11 -07:00
2018-06-27 14:59:17 -07:00
/* Before acquiring the MMU lock, see if we need to do any real work. */
2022-02-08 17:53:55 -05:00
free_active_root = ( roots_to_free & KVM_MMU_ROOT_CURRENT )
& & VALID_PAGE ( mmu - > root . hpa ) ;
if ( ! free_active_root ) {
2018-06-27 14:59:20 -07:00
for ( i = 0 ; i < KVM_MMU_NUM_PREV_ROOTS ; i + + )
if ( ( roots_to_free & KVM_MMU_ROOT_PREVIOUS ( i ) ) & &
VALID_PAGE ( mmu - > prev_roots [ i ] . hpa ) )
break ;
if ( i = = KVM_MMU_NUM_PREV_ROOTS )
return ;
}
2013-05-16 11:55:51 +03:00
2024-01-10 18:00:48 -08:00
if ( is_tdp_mmu )
read_lock ( & kvm - > mmu_lock ) ;
else
write_lock ( & kvm - > mmu_lock ) ;
2007-01-05 16:36:40 -08:00
2018-06-27 14:59:20 -07:00
for ( i = 0 ; i < KVM_MMU_NUM_PREV_ROOTS ; i + + )
if ( roots_to_free & KVM_MMU_ROOT_PREVIOUS ( i ) )
2020-09-23 12:12:04 -07:00
mmu_free_root_page ( kvm , & mmu - > prev_roots [ i ] . hpa ,
2018-06-27 14:59:20 -07:00
& invalid_list ) ;
2018-06-27 14:59:06 -07:00
2018-06-27 14:59:17 -07:00
if ( free_active_root ) {
KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
When attempting to allocate a shadow root for a !visible guest root gfn,
e.g. that resides in MMIO space, load a dummy root that is backed by the
zero page instead of immediately synthesizing a triple fault shutdown
(using the zero page ensures any attempt to translate memory will generate
a !PRESENT fault and thus VM-Exit).
Unless the vCPU is racing with memslot activity, KVM will inject a page
fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e.
the end result is mostly same, but critically KVM will inject a fault only
*after* KVM runs the vCPU with the bogus root.
Waiting to inject a fault until after running the vCPU fixes a bug where
KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled
and a !visible root. Even though a bad root will *probably* lead to
shutdown, (a) it's not guaranteed and (b) the CPU won't read the
underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with
a VMX preemption timer value of '0', then architecturally the preemption
timer VM-Exit is guaranteed to occur before the CPU executes any
instruction, i.e. before the CPU needs to translate a GPA to a HPA (so
long as there are no injected events with higher priority than the
preemption timer).
If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because
userspace created a memslot between installing the dummy root and handling
the page fault, simply unload the MMU to allocate a new root and retry the
instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as
invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and
conceptually the dummy root has indeeed become obsolete. The only
difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is
that the root has become obsolete due to memslot *creation*, not memslot
deletion or movement.
Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:52:00 -07:00
if ( kvm_mmu_is_dummy_root ( mmu - > root . hpa ) ) {
/* Nothing to cleanup for dummy roots. */
} else if ( root_to_sp ( mmu - > root . hpa ) ) {
2022-02-21 09:28:33 -05:00
mmu_free_root_page ( kvm , & mmu - > root . hpa , & invalid_list ) ;
2021-03-04 17:10:46 -08:00
} else if ( mmu - > pae_root ) {
2021-03-09 14:42:06 -08:00
for ( i = 0 ; i < 4 ; + + i ) {
if ( ! IS_VALID_PAE_ROOT ( mmu - > pae_root [ i ] ) )
continue ;
mmu_free_root_page ( kvm , & mmu - > pae_root [ i ] ,
& invalid_list ) ;
mmu - > pae_root [ i ] = INVALID_PAE_ROOT ;
}
2018-06-27 14:59:17 -07:00
}
2022-02-21 09:28:33 -05:00
mmu - > root . hpa = INVALID_PAGE ;
mmu - > root . pgd = 0 ;
2007-01-05 16:36:40 -08:00
}
2018-05-04 11:37:11 -07:00
2024-01-10 18:00:48 -08:00
if ( is_tdp_mmu ) {
read_unlock ( & kvm - > mmu_lock ) ;
WARN_ON_ONCE ( ! list_empty ( & invalid_list ) ) ;
} else {
kvm_mmu_commit_zap_page ( kvm , & invalid_list ) ;
write_unlock ( & kvm - > mmu_lock ) ;
}
2007-01-05 16:36:40 -08:00
}
2018-05-04 11:37:11 -07:00
EXPORT_SYMBOL_GPL ( kvm_mmu_free_roots ) ;
2007-01-05 16:36:40 -08:00
2022-02-21 09:31:51 -05:00
void kvm_mmu_free_guest_mode_roots ( struct kvm * kvm , struct kvm_mmu * mmu )
2021-06-09 16:42:29 -07:00
{
unsigned long roots_to_free = 0 ;
2023-07-28 17:51:56 -07:00
struct kvm_mmu_page * sp ;
2021-06-09 16:42:29 -07:00
hpa_t root_hpa ;
int i ;
/*
* This should not be called while L2 is active , L2 can ' t invalidate
* _only_ its own roots , e . g . INVVPID unconditionally exits .
*/
2022-02-14 08:46:24 -05:00
WARN_ON_ONCE ( mmu - > root_role . guest_mode ) ;
2021-06-09 16:42:29 -07:00
for ( i = 0 ; i < KVM_MMU_NUM_PREV_ROOTS ; i + + ) {
root_hpa = mmu - > prev_roots [ i ] . hpa ;
if ( ! VALID_PAGE ( root_hpa ) )
continue ;
2023-07-28 17:51:56 -07:00
sp = root_to_sp ( root_hpa ) ;
if ( ! sp | | sp - > role . guest_mode )
2021-06-09 16:42:29 -07:00
roots_to_free | = KVM_MMU_ROOT_PREVIOUS ( i ) ;
}
2022-02-21 09:31:51 -05:00
kvm_mmu_free_roots ( kvm , mmu , roots_to_free ) ;
2021-06-09 16:42:29 -07:00
}
EXPORT_SYMBOL_GPL ( kvm_mmu_free_guest_mode_roots ) ;
2022-06-22 15:26:51 -04:00
static hpa_t mmu_alloc_root ( struct kvm_vcpu * vcpu , gfn_t gfn , int quadrant ,
2022-06-22 15:26:50 -04:00
u8 level )
2010-09-10 17:30:59 +02:00
{
2022-06-22 15:26:51 -04:00
union kvm_mmu_page_role role = vcpu - > arch . mmu - > root_role ;
2010-09-10 17:30:59 +02:00
struct kvm_mmu_page * sp ;
2020-04-27 19:37:14 -07:00
2022-06-22 15:26:51 -04:00
role . level = level ;
2022-06-22 15:26:52 -04:00
role . quadrant = quadrant ;
2022-06-22 15:26:51 -04:00
2022-06-22 15:26:52 -04:00
WARN_ON_ONCE ( quadrant & & ! role . has_4_byte_gpte ) ;
WARN_ON_ONCE ( role . direct & & role . has_4_byte_gpte ) ;
2022-06-22 15:26:51 -04:00
2022-06-22 15:26:55 -04:00
sp = kvm_mmu_get_shadow_page ( vcpu , gfn , role ) ;
2020-04-27 19:37:14 -07:00
+ + sp - > root_count ;
return __pa ( sp - > spt ) ;
}
static int mmu_alloc_direct_roots ( struct kvm_vcpu * vcpu )
{
2021-03-04 17:10:47 -08:00
struct kvm_mmu * mmu = vcpu - > arch . mmu ;
2022-02-10 07:41:19 -05:00
u8 shadow_root_level = mmu - > root_role . level ;
2020-04-27 19:37:14 -07:00
hpa_t root ;
2010-10-03 18:51:39 +02:00
unsigned i ;
2021-04-08 08:10:25 -04:00
int r ;
2024-01-10 18:00:46 -08:00
if ( tdp_mmu_enabled )
return kvm_tdp_mmu_alloc_root ( vcpu ) ;
2021-04-08 08:10:25 -04:00
write_lock ( & vcpu - > kvm - > mmu_lock ) ;
r = make_mmu_pages_available ( vcpu ) ;
if ( r < 0 )
goto out_unlock ;
2010-09-10 17:30:59 +02:00
2024-01-10 18:00:46 -08:00
if ( shadow_root_level > = PT64_ROOT_4LEVEL ) {
2022-06-22 15:26:50 -04:00
root = mmu_alloc_root ( vcpu , 0 , 0 , shadow_root_level ) ;
2022-02-21 09:28:33 -05:00
mmu - > root . hpa = root ;
2020-04-27 19:37:14 -07:00
} else if ( shadow_root_level = = PT32E_ROOT_LEVEL ) {
2021-04-08 08:10:25 -04:00
if ( WARN_ON_ONCE ( ! mmu - > pae_root ) ) {
r = - EIO ;
goto out_unlock ;
}
2021-03-04 17:11:01 -08:00
2010-09-10 17:30:59 +02:00
for ( i = 0 ; i < 4 ; + + i ) {
2021-03-09 14:42:06 -08:00
WARN_ON_ONCE ( IS_VALID_PAE_ROOT ( mmu - > pae_root [ i ] ) ) ;
2010-09-10 17:30:59 +02:00
2022-06-22 15:26:52 -04:00
root = mmu_alloc_root ( vcpu , i < < ( 30 - PAGE_SHIFT ) , 0 ,
2022-06-22 15:26:51 -04:00
PT32_ROOT_LEVEL ) ;
2021-03-04 17:10:54 -08:00
mmu - > pae_root [ i ] = root | PT_PRESENT_MASK |
2022-06-08 09:20:15 +08:00
shadow_me_value ;
2010-09-10 17:30:59 +02:00
}
2022-02-21 09:28:33 -05:00
mmu - > root . hpa = __pa ( mmu - > pae_root ) ;
2021-03-04 17:11:01 -08:00
} else {
WARN_ONCE ( 1 , " Bad TDP root level = %d \n " , shadow_root_level ) ;
2021-04-08 08:10:25 -04:00
r = - EIO ;
goto out_unlock ;
2021-03-04 17:11:01 -08:00
}
2020-02-28 14:52:39 -08:00
2022-02-21 09:28:33 -05:00
/* root.pgd is ignored for direct MMUs. */
mmu - > root . pgd = 0 ;
2021-04-08 08:10:25 -04:00
out_unlock :
write_unlock ( & vcpu - > kvm - > mmu_lock ) ;
return r ;
2010-09-10 17:30:59 +02:00
}
2021-10-15 12:30:21 -04:00
static int mmu_first_shadow_root_alloc ( struct kvm * kvm )
{
struct kvm_memslots * slots ;
struct kvm_memory_slot * slot ;
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 20:54:30 +01:00
int r = 0 , i , bkt ;
2021-10-15 12:30:21 -04:00
/*
* Check if this is the first shadow root being allocated before
* taking the lock .
*/
if ( kvm_shadow_root_allocated ( kvm ) )
return 0 ;
mutex_lock ( & kvm - > slots_arch_lock ) ;
/* Recheck, under the lock, whether this is the first shadow root. */
if ( kvm_shadow_root_allocated ( kvm ) )
goto out_unlock ;
/*
* Check if anything actually needs to be allocated , e . g . all metadata
* will be allocated upfront if TDP is disabled .
*/
if ( kvm_memslots_have_rmaps ( kvm ) & &
kvm_page_track_write_tracking_enabled ( kvm ) )
goto out_success ;
2023-10-27 11:22:04 -07:00
for ( i = 0 ; i < kvm_arch_nr_memslot_as_ids ( kvm ) ; i + + ) {
2021-10-15 12:30:21 -04:00
slots = __kvm_memslots ( kvm , i ) ;
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-06 20:54:30 +01:00
kvm_for_each_memslot ( slot , bkt , slots ) {
2021-10-15 12:30:21 -04:00
/*
* Both of these functions are no - ops if the target is
* already allocated , so unconditionally calling both
* is safe . Intentionally do NOT free allocations on
* failure to avoid having to track which allocations
* were made now versus when the memslot was created .
* The metadata is guaranteed to be freed when the slot
* is freed , and will be kept / used if userspace retries
* KVM_RUN instead of killing the VM .
*/
r = memslot_rmap_alloc ( slot , slot - > npages ) ;
if ( r )
goto out_unlock ;
r = kvm_page_track_write_tracking_alloc ( slot ) ;
if ( r )
goto out_unlock ;
}
}
/*
* Ensure that shadow_root_allocated becomes true strictly after
* all the related pointers are set .
*/
out_success :
smp_store_release ( & kvm - > arch . shadow_root_allocated , true ) ;
out_unlock :
mutex_unlock ( & kvm - > slots_arch_lock ) ;
return r ;
}
2010-09-10 17:30:59 +02:00
static int mmu_alloc_shadow_roots ( struct kvm_vcpu * vcpu )
2007-01-05 16:36:40 -08:00
{
2021-03-04 17:10:47 -08:00
struct kvm_mmu * mmu = vcpu - > arch . mmu ;
2021-03-04 17:10:51 -08:00
u64 pdptrs [ 4 ] , pm_mask ;
2020-03-20 14:28:32 -07:00
gfn_t root_gfn , root_pgd ;
2022-06-22 15:26:52 -04:00
int quadrant , i , r ;
2020-04-27 19:37:14 -07:00
hpa_t root ;
2007-01-05 16:36:51 -08:00
2023-03-22 02:37:26 +01:00
root_pgd = kvm_mmu_get_guest_pgd ( vcpu , mmu ) ;
2023-09-13 20:42:16 +08:00
root_gfn = ( root_pgd & __PT_BASE_ADDR_MASK ) > > PAGE_SHIFT ;
2007-01-05 16:36:40 -08:00
KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
When attempting to allocate a shadow root for a !visible guest root gfn,
e.g. that resides in MMIO space, load a dummy root that is backed by the
zero page instead of immediately synthesizing a triple fault shutdown
(using the zero page ensures any attempt to translate memory will generate
a !PRESENT fault and thus VM-Exit).
Unless the vCPU is racing with memslot activity, KVM will inject a page
fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e.
the end result is mostly same, but critically KVM will inject a fault only
*after* KVM runs the vCPU with the bogus root.
Waiting to inject a fault until after running the vCPU fixes a bug where
KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled
and a !visible root. Even though a bad root will *probably* lead to
shutdown, (a) it's not guaranteed and (b) the CPU won't read the
underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with
a VMX preemption timer value of '0', then architecturally the preemption
timer VM-Exit is guaranteed to occur before the CPU executes any
instruction, i.e. before the CPU needs to translate a GPA to a HPA (so
long as there are no injected events with higher priority than the
preemption timer).
If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because
userspace created a memslot between installing the dummy root and handling
the page fault, simply unload the MMU to allocate a new root and retry the
instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as
invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and
conceptually the dummy root has indeeed become obsolete. The only
difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is
that the root has become obsolete due to memslot *creation*, not memslot
deletion or movement.
Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:52:00 -07:00
if ( ! kvm_vcpu_is_visible_gfn ( vcpu , root_gfn ) ) {
mmu - > root . hpa = kvm_mmu_get_dummy_root ( ) ;
return 0 ;
}
2010-09-10 17:30:59 +02:00
2021-04-08 08:10:25 -04:00
/*
* On SVM , reading PDPTRs might access guest memory , which might fault
* and thus might sleep . Grab the PDPTRs before acquiring mmu_lock .
*/
2022-02-10 07:42:22 -05:00
if ( mmu - > cpu_role . base . level = = PT32E_ROOT_LEVEL ) {
2021-03-04 17:10:51 -08:00
for ( i = 0 ; i < 4 ; + + i ) {
pdptrs [ i ] = mmu - > get_pdptr ( vcpu , i ) ;
if ( ! ( pdptrs [ i ] & PT_PRESENT_MASK ) )
continue ;
KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
When attempting to allocate a shadow root for a !visible guest root gfn,
e.g. that resides in MMIO space, load a dummy root that is backed by the
zero page instead of immediately synthesizing a triple fault shutdown
(using the zero page ensures any attempt to translate memory will generate
a !PRESENT fault and thus VM-Exit).
Unless the vCPU is racing with memslot activity, KVM will inject a page
fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e.
the end result is mostly same, but critically KVM will inject a fault only
*after* KVM runs the vCPU with the bogus root.
Waiting to inject a fault until after running the vCPU fixes a bug where
KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled
and a !visible root. Even though a bad root will *probably* lead to
shutdown, (a) it's not guaranteed and (b) the CPU won't read the
underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with
a VMX preemption timer value of '0', then architecturally the preemption
timer VM-Exit is guaranteed to occur before the CPU executes any
instruction, i.e. before the CPU needs to translate a GPA to a HPA (so
long as there are no injected events with higher priority than the
preemption timer).
If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because
userspace created a memslot between installing the dummy root and handling
the page fault, simply unload the MMU to allocate a new root and retry the
instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as
invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and
conceptually the dummy root has indeeed become obsolete. The only
difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is
that the root has become obsolete due to memslot *creation*, not memslot
deletion or movement.
Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:52:00 -07:00
if ( ! kvm_vcpu_is_visible_gfn ( vcpu , pdptrs [ i ] > > PAGE_SHIFT ) )
pdptrs [ i ] = 0 ;
2021-03-04 17:10:51 -08:00
}
}
2021-10-15 12:30:21 -04:00
r = mmu_first_shadow_root_alloc ( vcpu - > kvm ) ;
2021-05-18 10:34:14 -07:00
if ( r )
return r ;
2021-04-08 08:10:25 -04:00
write_lock ( & vcpu - > kvm - > mmu_lock ) ;
r = make_mmu_pages_available ( vcpu ) ;
if ( r < 0 )
goto out_unlock ;
2010-09-10 17:30:59 +02:00
/*
* Do we shadow a long mode page table ? If so we need to
* write - protect the guests page table root .
*/
2022-02-10 07:42:22 -05:00
if ( mmu - > cpu_role . base . level > = PT64_ROOT_4LEVEL ) {
2020-04-27 19:37:14 -07:00
root = mmu_alloc_root ( vcpu , root_gfn , 0 ,
2022-06-22 15:26:50 -04:00
mmu - > root_role . level ) ;
2022-02-21 09:28:33 -05:00
mmu - > root . hpa = root ;
2020-03-20 14:28:32 -07:00
goto set_root_pgd ;
2007-01-05 16:36:40 -08:00
}
2010-09-02 17:29:45 +02:00
2021-04-08 08:10:25 -04:00
if ( WARN_ON_ONCE ( ! mmu - > pae_root ) ) {
r = - EIO ;
goto out_unlock ;
}
2021-03-04 17:11:01 -08:00
2010-09-10 17:30:59 +02:00
/*
* We shadow a 32 bit page table . This may be a legacy 2 - level
2010-09-10 17:31:00 +02:00
* or a PAE 3 - level page table . In either case we need to be aware that
* the shadow page table may be a PAE or a long mode page table .
2010-09-10 17:30:59 +02:00
*/
KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits. Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits. TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either. Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.
KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory. MKTME KeyID bits should be set
to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check. So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).
Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:
- shadow_me_value: the memory encryption bit(s) that will be set to the
SPTE (the original shadow_me_mask).
- shadow_me_mask: all possible memory encryption bits (which is a super
set of shadow_me_value).
- For now, shadow_me_value is supposed to be set by SVM and VMX
respectively, and it is a constant during KVM's life time. This
perhaps doesn't fit MKTME but for now host kernel doesn't support it
(and perhaps will never do).
- Bits in shadow_me_mask are set to shadow_zero_check, except the bits
in shadow_me_value.
Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed. This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-19 23:17:03 +12:00
pm_mask = PT_PRESENT_MASK | shadow_me_value ;
2022-02-10 07:41:19 -05:00
if ( mmu - > root_role . level > = PT64_ROOT_4LEVEL ) {
2010-09-10 17:31:00 +02:00
pm_mask | = PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK ;
2021-05-05 13:42:21 -07:00
if ( WARN_ON_ONCE ( ! mmu - > pml4_root ) ) {
2021-04-08 08:10:25 -04:00
r = - EIO ;
goto out_unlock ;
}
2021-05-05 13:42:21 -07:00
mmu - > pml4_root [ 0 ] = __pa ( mmu - > pae_root ) | pm_mask ;
2021-08-18 11:55:48 -05:00
2022-02-10 07:41:19 -05:00
if ( mmu - > root_role . level = = PT64_ROOT_5LEVEL ) {
2021-08-18 11:55:48 -05:00
if ( WARN_ON_ONCE ( ! mmu - > pml5_root ) ) {
r = - EIO ;
goto out_unlock ;
}
mmu - > pml5_root [ 0 ] = __pa ( mmu - > pml4_root ) | pm_mask ;
}
2021-03-04 17:10:46 -08:00
}
2007-01-05 16:36:40 -08:00
for ( i = 0 ; i < 4 ; + + i ) {
2021-03-09 14:42:06 -08:00
WARN_ON_ONCE ( IS_VALID_PAE_ROOT ( mmu - > pae_root [ i ] ) ) ;
2021-03-04 17:10:50 -08:00
2022-02-10 07:42:22 -05:00
if ( mmu - > cpu_role . base . level = = PT32E_ROOT_LEVEL ) {
2021-03-04 17:10:51 -08:00
if ( ! ( pdptrs [ i ] & PT_PRESENT_MASK ) ) {
2021-03-09 14:42:06 -08:00
mmu - > pae_root [ i ] = INVALID_PAE_ROOT ;
2007-04-12 17:35:58 +03:00
continue ;
}
2021-03-04 17:10:51 -08:00
root_gfn = pdptrs [ i ] > > PAGE_SHIFT ;
2010-04-26 17:00:05 -07:00
}
2010-05-04 12:58:32 +03:00
2022-06-22 15:26:52 -04:00
/*
* If shadowing 32 - bit non - PAE page tables , each PAE page
* directory maps one quarter of the guest ' s non - PAE page
* directory . Othwerise each PAE page direct shadows one guest
* PAE page directory so that quadrant should be 0.
*/
quadrant = ( mmu - > cpu_role . base . level = = PT32_ROOT_LEVEL ) ? i : 0 ;
root = mmu_alloc_root ( vcpu , root_gfn , quadrant , PT32_ROOT_LEVEL ) ;
2021-03-04 17:10:47 -08:00
mmu - > pae_root [ i ] = root | pm_mask ;
2007-01-05 16:36:40 -08:00
}
2010-09-10 17:31:00 +02:00
2022-02-10 07:41:19 -05:00
if ( mmu - > root_role . level = = PT64_ROOT_5LEVEL )
2022-02-21 09:28:33 -05:00
mmu - > root . hpa = __pa ( mmu - > pml5_root ) ;
2022-02-10 07:41:19 -05:00
else if ( mmu - > root_role . level = = PT64_ROOT_4LEVEL )
2022-02-21 09:28:33 -05:00
mmu - > root . hpa = __pa ( mmu - > pml4_root ) ;
2021-03-04 17:10:48 -08:00
else
2022-02-21 09:28:33 -05:00
mmu - > root . hpa = __pa ( mmu - > pae_root ) ;
2010-09-10 17:31:00 +02:00
2020-03-20 14:28:32 -07:00
set_root_pgd :
2022-02-21 09:28:33 -05:00
mmu - > root . pgd = root_pgd ;
2021-04-08 08:10:25 -04:00
out_unlock :
write_unlock ( & vcpu - > kvm - > mmu_lock ) ;
2019-02-22 17:45:01 +01:00
2022-03-01 20:49:41 +08:00
return r ;
2007-01-05 16:36:40 -08:00
}
2021-03-04 17:10:49 -08:00
static int mmu_alloc_special_roots ( struct kvm_vcpu * vcpu )
{
struct kvm_mmu * mmu = vcpu - > arch . mmu ;
2022-02-10 07:41:19 -05:00
bool need_pml5 = mmu - > root_role . level > PT64_ROOT_4LEVEL ;
2021-08-18 11:55:48 -05:00
u64 * pml5_root = NULL ;
u64 * pml4_root = NULL ;
u64 * pae_root ;
2010-09-10 17:31:00 +02:00
/*
2021-03-04 17:10:49 -08:00
* When shadowing 32 - bit or PAE NPT with 64 - bit NPT , the PML4 and PDP
* tables are allocated and initialized at root creation as there is no
* equivalent level in the guest ' s NPT to shadow . Allocate the tables
* on demand , as running a 32 - bit L1 VMM on 64 - bit KVM is very rare .
2010-09-10 17:31:00 +02:00
*/
2022-02-10 08:00:56 -05:00
if ( mmu - > root_role . direct | |
mmu - > cpu_role . base . level > = PT64_ROOT_4LEVEL | |
2022-02-10 07:41:19 -05:00
mmu - > root_role . level < PT64_ROOT_4LEVEL )
2021-03-04 17:10:49 -08:00
return 0 ;
2010-09-10 17:31:00 +02:00
2021-08-23 17:58:24 -07:00
/*
* NPT , the only paging mode that uses this horror , uses a fixed number
* of levels for the shadow page tables , e . g . all MMUs are 4 - level or
* all MMus are 5 - level . Thus , this can safely require that pml5_root
* is allocated if the other roots are valid and pml5 is needed , as any
* prior MMU would also have required pml5 .
*/
if ( mmu - > pae_root & & mmu - > pml4_root & & ( ! need_pml5 | | mmu - > pml5_root ) )
2021-03-04 17:10:49 -08:00
return 0 ;
2010-09-10 17:31:00 +02:00
2021-03-04 17:10:49 -08:00
/*
* The special roots should always be allocated in concert . Yell and
* bail if KVM ends up in a state where only one of the roots is valid .
*/
2021-08-18 11:55:48 -05:00
if ( WARN_ON_ONCE ( ! tdp_enabled | | mmu - > pae_root | | mmu - > pml4_root | |
2021-08-23 17:58:24 -07:00
( need_pml5 & & mmu - > pml5_root ) ) )
2021-03-04 17:10:49 -08:00
return - EIO ;
2010-09-10 17:31:00 +02:00
2021-03-09 14:42:07 -08:00
/*
* Unlike 32 - bit NPT , the PDP table doesn ' t need to be in low mem , and
* doesn ' t need to be decrypted .
*/
2021-03-04 17:10:49 -08:00
pae_root = ( void * ) get_zeroed_page ( GFP_KERNEL_ACCOUNT ) ;
if ( ! pae_root )
return - ENOMEM ;
2010-09-10 17:31:00 +02:00
2021-08-18 11:55:48 -05:00
# ifdef CONFIG_X86_64
2021-05-05 13:42:21 -07:00
pml4_root = ( void * ) get_zeroed_page ( GFP_KERNEL_ACCOUNT ) ;
2021-08-18 11:55:48 -05:00
if ( ! pml4_root )
goto err_pml4 ;
2021-08-23 17:58:24 -07:00
if ( need_pml5 ) {
2021-08-18 11:55:48 -05:00
pml5_root = ( void * ) get_zeroed_page ( GFP_KERNEL_ACCOUNT ) ;
if ( ! pml5_root )
goto err_pml5 ;
2010-09-10 17:31:00 +02:00
}
2021-08-18 11:55:48 -05:00
# endif
2010-09-10 17:31:00 +02:00
2021-03-04 17:10:49 -08:00
mmu - > pae_root = pae_root ;
2021-05-05 13:42:21 -07:00
mmu - > pml4_root = pml4_root ;
2021-08-18 11:55:48 -05:00
mmu - > pml5_root = pml5_root ;
2019-02-22 17:45:01 +01:00
2009-05-12 18:55:45 -03:00
return 0 ;
2021-08-18 11:55:48 -05:00
# ifdef CONFIG_X86_64
err_pml5 :
free_page ( ( unsigned long ) pml4_root ) ;
err_pml4 :
free_page ( ( unsigned long ) pae_root ) ;
return - ENOMEM ;
# endif
2007-01-05 16:36:40 -08:00
}
2021-10-19 19:01:53 +08:00
static bool is_unsync_root ( hpa_t root )
{
struct kvm_mmu_page * sp ;
KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
When attempting to allocate a shadow root for a !visible guest root gfn,
e.g. that resides in MMIO space, load a dummy root that is backed by the
zero page instead of immediately synthesizing a triple fault shutdown
(using the zero page ensures any attempt to translate memory will generate
a !PRESENT fault and thus VM-Exit).
Unless the vCPU is racing with memslot activity, KVM will inject a page
fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e.
the end result is mostly same, but critically KVM will inject a fault only
*after* KVM runs the vCPU with the bogus root.
Waiting to inject a fault until after running the vCPU fixes a bug where
KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled
and a !visible root. Even though a bad root will *probably* lead to
shutdown, (a) it's not guaranteed and (b) the CPU won't read the
underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with
a VMX preemption timer value of '0', then architecturally the preemption
timer VM-Exit is guaranteed to occur before the CPU executes any
instruction, i.e. before the CPU needs to translate a GPA to a HPA (so
long as there are no injected events with higher priority than the
preemption timer).
If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because
userspace created a memslot between installing the dummy root and handling
the page fault, simply unload the MMU to allocate a new root and retry the
instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as
invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and
conceptually the dummy root has indeeed become obsolete. The only
difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is
that the root has become obsolete due to memslot *creation*, not memslot
deletion or movement.
Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:52:00 -07:00
if ( ! VALID_PAGE ( root ) | | kvm_mmu_is_dummy_root ( root ) )
2021-10-19 19:01:54 +08:00
return false ;
2021-10-19 19:01:53 +08:00
/*
* The read barrier orders the CPU ' s read of SPTE . W during the page table
* walk before the reads of sp - > unsync / sp - > unsync_children here .
*
* Even if another CPU was marking the SP as unsync - ed simultaneously ,
* any guest page table changes are not guaranteed to be visible anyway
* until this VCPU issues a TLB flush strictly after those changes are
* made . We only need to ensure that the other CPU sets these flags
* before any actual changes to the page tables are made . The comments
* in mmu_try_to_unsync_pages ( ) describe what could go wrong if this
* requirement isn ' t satisfied .
*/
smp_rmb ( ) ;
2023-07-28 17:51:56 -07:00
sp = root_to_sp ( root ) ;
2022-02-25 18:22:48 +00:00
/*
* PAE roots ( somewhat arbitrarily ) aren ' t backed by shadow pages , the
* PDPTEs for a given PAE root need to be synchronized individually .
*/
if ( WARN_ON_ONCE ( ! sp ) )
return false ;
2021-10-19 19:01:53 +08:00
if ( sp - > unsync | | sp - > unsync_children )
return true ;
return false ;
}
2018-06-27 14:59:05 -07:00
void kvm_mmu_sync_roots ( struct kvm_vcpu * vcpu )
2008-09-23 13:18:34 -03:00
{
int i ;
struct kvm_mmu_page * sp ;
2022-02-10 08:00:56 -05:00
if ( vcpu - > arch . mmu - > root_role . direct )
2010-09-10 17:31:00 +02:00
return ;
2022-02-21 09:28:33 -05:00
if ( ! VALID_PAGE ( vcpu - > arch . mmu - > root . hpa ) )
2008-09-23 13:18:34 -03:00
return ;
2010-09-27 18:09:29 +08:00
2014-08-18 15:46:07 -07:00
vcpu_clear_mmio_info ( vcpu , MMIO_GVA_ANY ) ;
2018-06-27 14:59:05 -07:00
2022-02-10 07:42:22 -05:00
if ( vcpu - > arch . mmu - > cpu_role . base . level > = PT64_ROOT_4LEVEL ) {
2022-02-21 09:28:33 -05:00
hpa_t root = vcpu - > arch . mmu - > root . hpa ;
2018-06-27 14:59:05 -07:00
2021-10-19 19:01:53 +08:00
if ( ! is_unsync_root ( root ) )
2018-06-27 14:59:05 -07:00
return ;
2023-07-28 17:51:56 -07:00
sp = root_to_sp ( root ) ;
2021-02-02 10:57:24 -08:00
write_lock ( & vcpu - > kvm - > mmu_lock ) ;
2021-09-18 08:56:28 +08:00
mmu_sync_children ( vcpu , sp , true ) ;
2021-02-02 10:57:24 -08:00
write_unlock ( & vcpu - > kvm - > mmu_lock ) ;
2008-09-23 13:18:34 -03:00
return ;
}
2018-06-27 14:59:05 -07:00
2021-02-02 10:57:24 -08:00
write_lock ( & vcpu - > kvm - > mmu_lock ) ;
2018-06-27 14:59:05 -07:00
2008-09-23 13:18:34 -03:00
for ( i = 0 ; i < 4 ; + + i ) {
2018-10-08 21:28:05 +02:00
hpa_t root = vcpu - > arch . mmu - > pae_root [ i ] ;
2008-09-23 13:18:34 -03:00
2021-03-09 14:42:06 -08:00
if ( IS_VALID_PAE_ROOT ( root ) ) {
2022-10-19 16:56:16 +00:00
sp = spte_to_child_sp ( root ) ;
2021-09-18 08:56:28 +08:00
mmu_sync_children ( vcpu , sp , true ) ;
2008-09-23 13:18:34 -03:00
}
}
2021-02-02 10:57:24 -08:00
write_unlock ( & vcpu - > kvm - > mmu_lock ) ;
2008-09-23 13:18:34 -03:00
}
2021-10-19 19:01:54 +08:00
void kvm_mmu_sync_prev_roots ( struct kvm_vcpu * vcpu )
{
unsigned long roots_to_free = 0 ;
int i ;
for ( i = 0 ; i < KVM_MMU_NUM_PREV_ROOTS ; i + + )
if ( is_unsync_root ( vcpu - > arch . mmu - > prev_roots [ i ] . hpa ) )
roots_to_free | = KVM_MMU_ROOT_PREVIOUS ( i ) ;
/* sync prev_roots by simply freeing them */
2022-02-21 09:31:51 -05:00
kvm_mmu_free_roots ( vcpu - > kvm , vcpu - > arch . mmu , roots_to_free ) ;
2021-10-19 19:01:54 +08:00
}
2021-11-24 20:20:44 +08:00
static gpa_t nonpaging_gva_to_gpa ( struct kvm_vcpu * vcpu , struct kvm_mmu * mmu ,
2022-03-11 15:03:41 +08:00
gpa_t vaddr , u64 access ,
2021-11-24 20:20:44 +08:00
struct x86_exception * exception )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
2010-11-22 17:53:26 +02:00
if ( exception )
exception - > error_code = 0 ;
2021-11-24 20:20:45 +08:00
return kvm_translate_gpa ( vcpu , mmu , vaddr , access , exception ) ;
2010-09-10 17:30:50 +02:00
}
2016-02-22 17:23:40 +09:00
static bool mmio_info_in_cache ( struct kvm_vcpu * vcpu , u64 addr , bool direct )
2011-07-12 03:33:44 +08:00
{
2017-08-17 18:36:58 +02:00
/*
* A nested guest cannot use the MMIO cache if it is using nested
* page tables , because cr2 is a nGPA while the cache stores GPAs .
*/
if ( mmu_is_nested ( vcpu ) )
return false ;
2011-07-12 03:33:44 +08:00
if ( direct )
return vcpu_match_mmio_gpa ( vcpu , addr ) ;
return vcpu_match_mmio_gva ( vcpu , addr ) ;
}
2020-10-14 11:26:58 -07:00
/*
* Return the level of the lowest level SPTE added to sptes .
* That SPTE may be non - present .
2021-07-13 22:09:54 +00:00
*
* Must be called between walk_shadow_page_lockless_ { begin , end } .
2020-10-14 11:26:58 -07:00
*/
2020-12-17 16:31:37 -08:00
static int get_walk ( struct kvm_vcpu * vcpu , u64 addr , u64 * sptes , int * root_level )
2011-07-12 03:33:44 +08:00
{
struct kvm_shadow_walk_iterator iterator ;
2020-12-17 16:31:36 -08:00
int leaf = - 1 ;
2020-10-14 11:26:58 -07:00
u64 spte ;
2011-07-12 03:33:44 +08:00
2020-12-17 16:31:37 -08:00
for ( shadow_walk_init ( & iterator , vcpu , addr ) ,
* root_level = iterator . level ;
2015-08-05 12:04:26 +08:00
shadow_walk_okay ( & iterator ) ;
__shadow_walk_next ( & iterator , spte ) ) {
2020-10-14 11:26:58 -07:00
leaf = iterator . level ;
2015-08-05 12:04:26 +08:00
spte = mmu_spte_get_lockless ( iterator . sptep ) ;
2020-12-17 16:31:38 -08:00
sptes [ leaf ] = spte ;
2020-10-14 11:26:58 -07:00
}
return leaf ;
}
2020-12-17 16:31:39 -08:00
/* return true if reserved bit(s) are detected on a valid, non-MMIO SPTE. */
2020-10-14 11:26:58 -07:00
static bool get_mmio_spte ( struct kvm_vcpu * vcpu , u64 addr , u64 * sptep )
{
2020-12-17 16:31:38 -08:00
u64 sptes [ PT64_ROOT_MAX_LEVEL + 1 ] ;
2020-10-14 11:26:58 -07:00
struct rsvd_bits_validate * rsvd_check ;
2020-12-17 16:31:37 -08:00
int root , leaf , level ;
2020-10-14 11:26:58 -07:00
bool reserved = false ;
2021-07-13 22:09:54 +00:00
walk_shadow_page_lockless_begin ( vcpu ) ;
2022-10-12 18:16:59 +00:00
if ( is_tdp_mmu_active ( vcpu ) )
2020-12-17 16:31:37 -08:00
leaf = kvm_tdp_mmu_get_walk ( vcpu , addr , sptes , & root ) ;
2020-10-14 11:26:58 -07:00
else
2020-12-17 16:31:37 -08:00
leaf = get_walk ( vcpu , addr , sptes , & root ) ;
2020-10-14 11:26:58 -07:00
2021-07-13 22:09:54 +00:00
walk_shadow_page_lockless_end ( vcpu ) ;
2020-12-17 16:31:36 -08:00
if ( unlikely ( leaf < 0 ) ) {
* sptep = 0ull ;
return reserved ;
}
2020-12-17 16:31:39 -08:00
* sptep = sptes [ leaf ] ;
/*
* Skip reserved bits checks on the terminal leaf if it ' s not a valid
* SPTE . Note , this also ( intentionally ) skips MMIO SPTEs , which , by
* design , always have reserved bits set . The purpose of the checks is
* to detect reserved bits on non - MMIO SPTEs . i . e . buggy SPTEs .
*/
if ( ! is_shadow_present_pte ( sptes [ leaf ] ) )
leaf + + ;
2020-10-14 11:26:58 -07:00
rsvd_check = & vcpu - > arch . mmu - > shadow_zero_check ;
2020-12-17 16:31:39 -08:00
for ( level = root ; level > = leaf ; level - - )
2021-06-22 10:57:32 -07:00
reserved | = is_rsvd_spte ( rsvd_check , sptes [ level ] , level ) ;
2015-08-05 12:04:26 +08:00
if ( reserved ) {
2021-02-25 12:47:49 -08:00
pr_err ( " %s: reserved bits set on MMU-present spte, addr 0x%llx, hierarchy: \n " ,
2015-08-05 12:04:26 +08:00
__func__ , addr ) ;
2020-10-14 11:26:58 -07:00
for ( level = root ; level > = leaf ; level - - )
2021-02-25 12:47:49 -08:00
pr_err ( " ------ spte = 0x%llx level = %d, rsvd bits = 0x%llx " ,
sptes [ level ] , level ,
2021-06-22 10:57:32 -07:00
get_rsvd_bits ( rsvd_check , sptes [ level ] , level ) ) ;
2015-08-05 12:04:26 +08:00
}
KVM: x86/mmu: Move root_hpa validity checks to top of page fault handler
Add a check on root_hpa at the beginning of the page fault handler to
consolidate several checks on root_hpa that are scattered throughout the
page fault code. This is a preparatory step towards eventually removing
such checks altogether, or at the very least WARNing if an invalid root
is encountered. Remove only the checks that can be easily audited to
confirm that root_hpa cannot be invalidated between their current
location and the new check in kvm_mmu_page_fault(), and aren't currently
protected by mmu_lock, i.e. keep the checks in __direct_map() and
FNAME(fetch) for the time being.
The root_hpa checks that are consolidate were all added by commit
37f6a4e237303 ("KVM: x86: handle invalid root_hpa everywhere")
which was a follow up to a bug fix for __direct_map(), commit
989c6b34f6a94 ("KVM: MMU: handle invalid root_hpa at __direct_map")
At the time, nested VMX had, in hindsight, crazy handling of nested
interrupts and would trigger a nested VM-Exit in ->interrupt_allowed(),
and thus unexpectedly reset the MMU in flows such as can_do_async_pf().
Now that the wonky nested VM-Exit behavior is gone, the root_hpa checks
are bogus and confusing, e.g. it's not at all obvious what they actually
protect against, and at first glance they appear to be broken since many
of them run without holding mmu_lock.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-06 15:57:27 -08:00
2015-08-05 12:04:26 +08:00
return reserved ;
2011-07-12 03:33:44 +08:00
}
2017-08-17 18:36:56 +02:00
static int handle_mmio_page_fault ( struct kvm_vcpu * vcpu , u64 addr , bool direct )
2011-07-12 03:33:44 +08:00
{
u64 spte ;
2015-08-05 12:04:26 +08:00
bool reserved ;
2011-07-12 03:33:44 +08:00
2016-02-22 17:23:40 +09:00
if ( mmio_info_in_cache ( vcpu , addr , direct ) )
2017-08-17 15:03:32 +02:00
return RET_PF_EMULATE ;
2011-07-12 03:33:44 +08:00
2020-10-14 11:26:58 -07:00
reserved = get_mmio_spte ( vcpu , addr , & spte ) ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
if ( WARN_ON_ONCE ( reserved ) )
2017-08-17 15:03:32 +02:00
return - EINVAL ;
2011-07-12 03:33:44 +08:00
if ( is_mmio_spte ( spte ) ) {
gfn_t gfn = get_mmio_spte_gfn ( spte ) ;
2020-02-03 15:09:09 -08:00
unsigned int access = get_mmio_spte_access ( spte ) ;
2011-07-12 03:33:44 +08:00
2015-04-08 15:39:23 +02:00
if ( ! check_mmio_spte ( vcpu , spte ) )
2017-08-17 15:03:32 +02:00
return RET_PF_INVALID ;
2013-06-07 16:51:26 +08:00
2011-07-12 03:33:44 +08:00
if ( direct )
addr = 0 ;
2011-07-12 03:34:24 +08:00
trace_handle_mmio_page_fault ( addr , gfn , access ) ;
2011-07-12 03:33:44 +08:00
vcpu_cache_mmio_info ( vcpu , addr , gfn , access ) ;
2017-08-17 15:03:32 +02:00
return RET_PF_EMULATE ;
2011-07-12 03:33:44 +08:00
}
/*
* If the page table is zapped by other cpus , let CPU fault again on
* the address .
*/
2017-08-17 15:03:32 +02:00
return RET_PF_RETRY ;
2011-07-12 03:33:44 +08:00
}
2016-02-24 17:51:11 +08:00
static bool page_fault_handle_page_track ( struct kvm_vcpu * vcpu ,
2021-08-06 04:21:58 -04:00
struct kvm_page_fault * fault )
2016-02-24 17:51:11 +08:00
{
2021-08-06 04:21:58 -04:00
if ( unlikely ( fault - > rsvd ) )
2016-02-24 17:51:11 +08:00
return false ;
2021-08-06 04:21:58 -04:00
if ( ! fault - > present | | ! fault - > write )
2016-02-24 17:51:11 +08:00
return false ;
/*
* guest is writing the page which is write tracked which can
* not be fixed by page fault handler .
*/
2023-07-28 18:35:30 -07:00
if ( kvm_gfn_is_write_tracked ( vcpu - > kvm , fault - > slot , fault - > gfn ) )
2016-02-24 17:51:11 +08:00
return true ;
return false ;
}
2016-02-24 17:51:12 +08:00
static void shadow_page_table_clear_flood ( struct kvm_vcpu * vcpu , gva_t addr )
{
struct kvm_shadow_walk_iterator iterator ;
u64 spte ;
walk_shadow_page_lockless_begin ( vcpu ) ;
2021-09-06 20:25:47 +08:00
for_each_shadow_entry_lockless ( vcpu , addr , iterator , spte )
2016-02-24 17:51:12 +08:00
clear_sp_write_flooding_count ( iterator . sptep ) ;
walk_shadow_page_lockless_end ( vcpu ) ;
}
2022-02-22 11:12:39 +08:00
static u32 alloc_apf_token ( struct kvm_vcpu * vcpu )
{
/* make sure the token value is not 0 */
u32 id = vcpu - > arch . apf . id ;
if ( id < < 12 = = 0 )
vcpu - > arch . apf . id = 1 ;
return ( vcpu - > arch . apf . id + + < < 12 ) | vcpu - > vcpu_id ;
}
2020-06-15 14:13:34 +02:00
static bool kvm_arch_setup_async_pf ( struct kvm_vcpu * vcpu , gpa_t cr2_or_gpa ,
gfn_t gfn )
2010-10-14 11:22:46 +02:00
{
struct kvm_arch_async_pf arch ;
2010-12-07 10:35:25 +08:00
2022-02-22 11:12:39 +08:00
arch . token = alloc_apf_token ( vcpu ) ;
2010-10-14 11:22:46 +02:00
arch . gfn = gfn ;
2022-02-10 08:00:56 -05:00
arch . direct_map = vcpu - > arch . mmu - > root_role . direct ;
2023-03-22 02:37:26 +01:00
arch . cr3 = kvm_mmu_get_guest_pgd ( vcpu , vcpu - > arch . mmu ) ;
2010-10-14 11:22:46 +02:00
2019-12-06 15:57:17 -08:00
return kvm_setup_async_pf ( vcpu , cr2_or_gpa ,
kvm_vcpu_gfn_to_hva ( vcpu , gfn ) , & arch ) ;
2010-10-14 11:22:46 +02:00
}
2022-04-23 03:47:47 +00:00
void kvm_arch_async_page_ready ( struct kvm_vcpu * vcpu , struct kvm_async_pf * work )
{
int r ;
if ( ( vcpu - > arch . mmu - > root_role . direct ! = work - > arch . direct_map ) | |
work - > wakeup_all )
return ;
r = kvm_mmu_reload ( vcpu ) ;
if ( unlikely ( r ) )
return ;
if ( ! vcpu - > arch . mmu - > root_role . direct & &
2023-03-22 02:37:26 +01:00
work - > arch . cr3 ! = kvm_mmu_get_guest_pgd ( vcpu , vcpu - > arch . mmu ) )
2022-04-23 03:47:47 +00:00
return ;
2023-02-02 18:28:15 +00:00
kvm_mmu_do_page_fault ( vcpu , work - > cr2_or_gpa , 0 , true , NULL ) ;
2022-04-23 03:47:47 +00:00
}
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 11:22:02 -07:00
static inline u8 kvm_max_level_for_order ( int order )
{
BUILD_BUG_ON ( KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G ) ;
KVM_MMU_WARN_ON ( order ! = KVM_HPAGE_GFN_SHIFT ( PG_LEVEL_1G ) & &
order ! = KVM_HPAGE_GFN_SHIFT ( PG_LEVEL_2M ) & &
order ! = KVM_HPAGE_GFN_SHIFT ( PG_LEVEL_4K ) ) ;
if ( order > = KVM_HPAGE_GFN_SHIFT ( PG_LEVEL_1G ) )
return PG_LEVEL_1G ;
if ( order > = KVM_HPAGE_GFN_SHIFT ( PG_LEVEL_2M ) )
return PG_LEVEL_2M ;
return PG_LEVEL_4K ;
}
static void kvm_mmu_prepare_memory_fault_exit ( struct kvm_vcpu * vcpu ,
struct kvm_page_fault * fault )
{
kvm_prepare_memory_fault_exit ( vcpu , fault - > gfn < < PAGE_SHIFT ,
PAGE_SIZE , fault - > write , fault - > exec ,
fault - > is_private ) ;
}
static int kvm_faultin_pfn_private ( struct kvm_vcpu * vcpu ,
struct kvm_page_fault * fault )
{
int max_order , r ;
if ( ! kvm_slot_can_be_private ( fault - > slot ) ) {
kvm_mmu_prepare_memory_fault_exit ( vcpu , fault ) ;
return - EFAULT ;
}
r = kvm_gmem_get_pfn ( vcpu - > kvm , fault - > slot , fault - > gfn , & fault - > pfn ,
& max_order ) ;
if ( r ) {
kvm_mmu_prepare_memory_fault_exit ( vcpu , fault ) ;
return r ;
}
fault - > max_level = min ( kvm_max_level_for_order ( max_order ) ,
fault - > max_level ) ;
fault - > map_writable = ! ( fault - > slot - > flags & KVM_MEM_READONLY ) ;
return RET_PF_CONTINUE ;
}
2022-09-21 10:35:39 -07:00
static int __kvm_faultin_pfn ( struct kvm_vcpu * vcpu , struct kvm_page_fault * fault )
2010-10-14 11:22:46 +02:00
{
2021-09-24 05:05:26 -04:00
struct kvm_memory_slot * slot = fault - > slot ;
2010-10-14 11:22:46 +02:00
bool async ;
2021-02-25 12:47:30 -08:00
/*
* Retry the page fault if the gfn hit a memslot that is being deleted
* or moved . This ensures any existing SPTEs for the old memslot will
* be zapped before KVM inserts a new MMIO SPTE for the gfn .
*/
if ( slot & & ( slot - > flags & KVM_MEMSLOT_INVALID ) )
2022-04-23 03:47:46 +00:00
return RET_PF_RETRY ;
2021-02-25 12:47:30 -08:00
2021-08-10 23:52:42 +03:00
if ( ! kvm_is_visible_memslot ( slot ) ) {
/* Don't expose private memslots to L2. */
if ( is_guest_mode ( vcpu ) ) {
2021-09-24 05:05:26 -04:00
fault - > slot = NULL ;
2021-08-07 08:57:34 -04:00
fault - > pfn = KVM_PFN_NOSLOT ;
fault - > map_writable = false ;
2022-04-23 03:47:46 +00:00
return RET_PF_CONTINUE ;
2021-08-10 23:52:42 +03:00
}
/*
* If the APIC access page exists but is disabled , go directly
* to emulation without caching the MMIO access or creating a
* MMIO SPTE . That way the cache doesn ' t need to be purged
* when the AVIC is re - enabled .
*/
if ( slot & & slot - > id = = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT & &
2022-04-23 03:47:46 +00:00
! kvm_apicv_activated ( vcpu - > kvm ) )
return RET_PF_EMULATE ;
2018-05-09 17:02:05 -04:00
}
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 11:22:02 -07:00
if ( fault - > is_private ! = kvm_mem_is_private ( vcpu - > kvm , fault - > gfn ) ) {
kvm_mmu_prepare_memory_fault_exit ( vcpu , fault ) ;
return - EFAULT ;
}
if ( fault - > is_private )
return kvm_faultin_pfn_private ( vcpu , fault ) ;
2015-04-02 11:20:48 +02:00
async = false ;
2022-10-11 15:58:08 -04:00
fault - > pfn = __gfn_to_pfn_memslot ( slot , fault - > gfn , false , false , & async ,
2021-08-07 08:57:34 -04:00
fault - > write , & fault - > map_writable ,
& fault - > hva ) ;
2010-10-14 11:22:46 +02:00
if ( ! async )
2022-04-23 03:47:46 +00:00
return RET_PF_CONTINUE ; /* *pfn has correct page already */
2010-10-14 11:22:46 +02:00
2021-09-29 09:19:32 -04:00
if ( ! fault - > prefetch & & kvm_can_do_async_pf ( vcpu ) ) {
2021-08-07 08:57:34 -04:00
trace_kvm_try_async_get_page ( fault - > addr , fault - > gfn ) ;
if ( kvm_find_async_pf_gfn ( vcpu , fault - > gfn ) ) {
2022-08-07 05:21:41 +00:00
trace_kvm_async_pf_repeated_fault ( fault - > addr , fault - > gfn ) ;
2010-10-14 11:22:46 +02:00
kvm_make_request ( KVM_REQ_APF_HALT , vcpu ) ;
2022-04-23 03:47:46 +00:00
return RET_PF_RETRY ;
} else if ( kvm_arch_setup_async_pf ( vcpu , fault - > addr , fault - > gfn ) ) {
return RET_PF_RETRY ;
}
2010-10-14 11:22:46 +02:00
}
2022-10-11 15:59:47 -04:00
/*
* Allow gup to bail on pending non - fatal signals when it ' s also allowed
* to wait for IO . Note , gup always bails if it is unable to quickly
* get a page and a fatal signal , i . e . SIGKILL , is pending .
*/
fault - > pfn = __gfn_to_pfn_memslot ( slot , fault - > gfn , false , true , NULL ,
2021-08-07 08:57:34 -04:00
fault - > write , & fault - > map_writable ,
& fault - > hva ) ;
2022-04-23 03:47:46 +00:00
return RET_PF_CONTINUE ;
2010-10-14 11:22:46 +02:00
}
2022-09-21 10:35:42 -07:00
static int kvm_faultin_pfn ( struct kvm_vcpu * vcpu , struct kvm_page_fault * fault ,
unsigned int access )
2022-09-21 10:35:39 -07:00
{
2022-09-21 10:35:40 -07:00
int ret ;
2022-09-21 10:35:39 -07:00
fault - > mmu_seq = vcpu - > kvm - > mmu_invalidate_seq ;
smp_rmb ( ) ;
KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
Retry page faults without acquiring mmu_lock, and without even faulting
the page into the primary MMU, if the resolved gfn is covered by an active
invalidation. Contending for mmu_lock is especially problematic on
preemptible kernels as the mmu_notifier invalidation task will yield
mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and
ultimately increase the latency of resolving the page fault. And in the
worst case scenario, yielding will be accompanied by a remote TLB flush,
e.g. if the invalidation covers a large range of memory and vCPUs are
accessing addresses that were already zapped.
Faulting the page into the primary MMU is similarly problematic, as doing
so may acquire locks that need to be taken for the invalidation to
complete (the primary MMU has finer grained locks than KVM's MMU), and/or
may cause unnecessary churn (getting/putting pages, marking them accessed,
etc).
Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.
Add a dedicated lockess version of the range-based retry check to avoid
false positives on the sanity check on start+end WARN, and so that it's
super obvious that checking for a racing invalidation without holding
mmu_lock is unsafe (though obviously useful).
Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
invalidation in a loop won't put KVM into an infinite loop, e.g. due to
caching the in-progress flag and never seeing it go to '0'.
Force a load of mmu_invalidate_seq as well, even though it isn't strictly
necessary to avoid an infinite loop, as doing so improves the probability
that KVM will detect an invalidation that already completed before
acquiring mmu_lock and bailing anyways.
Do the pre-check even for non-preemptible kernels, as waiting to detect
the invalidation until mmu_lock is held guarantees the vCPU will observe
the worst case latency in terms of handling the fault, and can generate
even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock,
detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
eventually re-acquire mmu_lock. This behavior is also why there are no
new starvation issues due to losing the fairness guarantees provided by
rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
on mmu_lock doesn't guarantee forward progress in the face of _another_
mmu_notifier invalidation event.
Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
may generate a load into a register instead of doing a direct comparison
(MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
is a few bytes of code and maaaaybe a cycle or three.
Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
Reported-by: Friedrich Weber <f.weber@proxmox.com>
Cc: Kai Huang <kai.huang@intel.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Yuan Yao <yuan.yao@linux.intel.com>
Cc: Xu Yilun <yilun.xu@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-21 17:26:40 -08:00
/*
* Check for a relevant mmu_notifier invalidation event before getting
* the pfn from the primary MMU , and before acquiring mmu_lock .
*
* For mmu_lock , if there is an in - progress invalidation and the kernel
* allows preemption , the invalidation task may drop mmu_lock and yield
* in response to mmu_lock being contended , which is * very * counter -
* productive as this vCPU can ' t actually make forward progress until
* the invalidation completes .
*
* Retrying now can also avoid unnessary lock contention in the primary
* MMU , as the primary MMU doesn ' t necessarily hold a single lock for
* the duration of the invalidation , i . e . faulting in a conflicting pfn
* can cause the invalidation to take longer by holding locks that are
* needed to complete the invalidation .
*
* Do the pre - check even for non - preemtible kernels , i . e . even if KVM
* will never yield mmu_lock in response to contention , as this vCPU is
* * guaranteed * to need to retry , i . e . waiting until mmu_lock is held
* to detect retry guarantees the worst case latency for the vCPU .
*/
if ( fault - > slot & &
mmu_invalidate_retry_gfn_unsafe ( vcpu - > kvm , fault - > mmu_seq , fault - > gfn ) )
return RET_PF_RETRY ;
2022-09-21 10:35:40 -07:00
ret = __kvm_faultin_pfn ( vcpu , fault ) ;
if ( ret ! = RET_PF_CONTINUE )
return ret ;
if ( unlikely ( is_error_pfn ( fault - > pfn ) ) )
2022-09-21 10:35:41 -07:00
return kvm_handle_error_pfn ( vcpu , fault ) ;
2022-09-21 10:35:40 -07:00
2022-09-21 10:35:42 -07:00
if ( unlikely ( ! fault - > slot ) )
return kvm_handle_noslot_fault ( vcpu , fault , access ) ;
KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
Retry page faults without acquiring mmu_lock, and without even faulting
the page into the primary MMU, if the resolved gfn is covered by an active
invalidation. Contending for mmu_lock is especially problematic on
preemptible kernels as the mmu_notifier invalidation task will yield
mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and
ultimately increase the latency of resolving the page fault. And in the
worst case scenario, yielding will be accompanied by a remote TLB flush,
e.g. if the invalidation covers a large range of memory and vCPUs are
accessing addresses that were already zapped.
Faulting the page into the primary MMU is similarly problematic, as doing
so may acquire locks that need to be taken for the invalidation to
complete (the primary MMU has finer grained locks than KVM's MMU), and/or
may cause unnecessary churn (getting/putting pages, marking them accessed,
etc).
Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.
Add a dedicated lockess version of the range-based retry check to avoid
false positives on the sanity check on start+end WARN, and so that it's
super obvious that checking for a racing invalidation without holding
mmu_lock is unsafe (though obviously useful).
Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
invalidation in a loop won't put KVM into an infinite loop, e.g. due to
caching the in-progress flag and never seeing it go to '0'.
Force a load of mmu_invalidate_seq as well, even though it isn't strictly
necessary to avoid an infinite loop, as doing so improves the probability
that KVM will detect an invalidation that already completed before
acquiring mmu_lock and bailing anyways.
Do the pre-check even for non-preemptible kernels, as waiting to detect
the invalidation until mmu_lock is held guarantees the vCPU will observe
the worst case latency in terms of handling the fault, and can generate
even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock,
detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
eventually re-acquire mmu_lock. This behavior is also why there are no
new starvation issues due to losing the fairness guarantees provided by
rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
on mmu_lock doesn't guarantee forward progress in the face of _another_
mmu_notifier invalidation event.
Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
may generate a load into a register instead of doing a direct comparison
(MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
is a few bytes of code and maaaaybe a cycle or three.
Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
Reported-by: Friedrich Weber <f.weber@proxmox.com>
Cc: Kai Huang <kai.huang@intel.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Yuan Yao <yuan.yao@linux.intel.com>
Cc: Xu Yilun <yilun.xu@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-21 17:26:40 -08:00
/*
* Check again for a relevant mmu_notifier invalidation event purely to
* avoid contending mmu_lock . Most invalidations will be detected by
* the previous check , but checking is extremely cheap relative to the
* overall cost of failing to detect the invalidation until after
* mmu_lock is acquired .
*/
if ( mmu_invalidate_retry_gfn_unsafe ( vcpu - > kvm , fault - > mmu_seq , fault - > gfn ) ) {
kvm_release_pfn_clean ( fault - > pfn ) ;
return RET_PF_RETRY ;
}
2022-09-21 10:35:40 -07:00
return RET_PF_CONTINUE ;
2022-09-21 10:35:39 -07:00
}
KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.
For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.
For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest. But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root. Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete. But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.
Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-20 04:50:22 +00:00
/*
* Returns true if the page fault is stale and needs to be retried , i . e . if the
* root was invalidated by a memslot update or a relevant mmu_notifier fired .
*/
static bool is_page_fault_stale ( struct kvm_vcpu * vcpu ,
2022-09-21 10:35:39 -07:00
struct kvm_page_fault * fault )
KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.
For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.
For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest. But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root. Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete. But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.
Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-20 04:50:22 +00:00
{
2023-07-28 17:51:56 -07:00
struct kvm_mmu_page * sp = root_to_sp ( vcpu - > arch . mmu - > root . hpa ) ;
2021-12-09 06:05:46 +00:00
/* Special roots, e.g. pae_root, are not backed by shadow pages. */
if ( sp & & is_obsolete_sp ( vcpu - > kvm , sp ) )
return true ;
/*
* Roots without an associated shadow page are considered invalid if
* there is a pending request to free obsolete roots . The request is
* only a hint that the current root _may_ be obsolete and needs to be
* reloaded , e . g . if the guest frees a PGD that KVM is tracking as a
* previous root , then __kvm_mmu_prepare_zap_page ( ) signals all vCPUs
* to reload even if no vCPU is actively using the root .
*/
2022-02-25 18:22:45 +00:00
if ( ! sp & & kvm_test_request ( KVM_REQ_MMU_FREE_OBSOLETE_ROOTS , vcpu ) )
KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.
For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.
For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest. But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root. Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete. But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.
Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-20 04:50:22 +00:00
return true ;
KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
Retry page faults without acquiring mmu_lock, and without even faulting
the page into the primary MMU, if the resolved gfn is covered by an active
invalidation. Contending for mmu_lock is especially problematic on
preemptible kernels as the mmu_notifier invalidation task will yield
mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and
ultimately increase the latency of resolving the page fault. And in the
worst case scenario, yielding will be accompanied by a remote TLB flush,
e.g. if the invalidation covers a large range of memory and vCPUs are
accessing addresses that were already zapped.
Faulting the page into the primary MMU is similarly problematic, as doing
so may acquire locks that need to be taken for the invalidation to
complete (the primary MMU has finer grained locks than KVM's MMU), and/or
may cause unnecessary churn (getting/putting pages, marking them accessed,
etc).
Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.
Add a dedicated lockess version of the range-based retry check to avoid
false positives on the sanity check on start+end WARN, and so that it's
super obvious that checking for a racing invalidation without holding
mmu_lock is unsafe (though obviously useful).
Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
invalidation in a loop won't put KVM into an infinite loop, e.g. due to
caching the in-progress flag and never seeing it go to '0'.
Force a load of mmu_invalidate_seq as well, even though it isn't strictly
necessary to avoid an infinite loop, as doing so improves the probability
that KVM will detect an invalidation that already completed before
acquiring mmu_lock and bailing anyways.
Do the pre-check even for non-preemptible kernels, as waiting to detect
the invalidation until mmu_lock is held guarantees the vCPU will observe
the worst case latency in terms of handling the fault, and can generate
even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock,
detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
eventually re-acquire mmu_lock. This behavior is also why there are no
new starvation issues due to losing the fairness guarantees provided by
rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
on mmu_lock doesn't guarantee forward progress in the face of _another_
mmu_notifier invalidation event.
Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
may generate a load into a register instead of doing a direct comparison
(MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
is a few bytes of code and maaaaybe a cycle or three.
Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
Reported-by: Friedrich Weber <f.weber@proxmox.com>
Cc: Kai Huang <kai.huang@intel.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Yuan Yao <yuan.yao@linux.intel.com>
Cc: Xu Yilun <yilun.xu@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-21 17:26:40 -08:00
/*
* Check for a relevant mmu_notifier invalidation event one last time
* now that mmu_lock is held , as the " unsafe " checks performed without
* holding mmu_lock can get false negatives .
*/
KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.
For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.
For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest. But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root. Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete. But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.
Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-20 04:50:22 +00:00
return fault - > slot & &
2023-10-27 11:21:45 -07:00
mmu_invalidate_retry_gfn ( vcpu - > kvm , fault - > mmu_seq , fault - > gfn ) ;
KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.
For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.
For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest. But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root. Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete. But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.
Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-20 04:50:22 +00:00
}
2021-08-06 04:21:58 -04:00
static int direct_page_fault ( struct kvm_vcpu * vcpu , struct kvm_page_fault * fault )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
2020-01-08 12:24:43 -08:00
int r ;
2011-07-12 03:33:44 +08:00
KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
When attempting to allocate a shadow root for a !visible guest root gfn,
e.g. that resides in MMIO space, load a dummy root that is backed by the
zero page instead of immediately synthesizing a triple fault shutdown
(using the zero page ensures any attempt to translate memory will generate
a !PRESENT fault and thus VM-Exit).
Unless the vCPU is racing with memslot activity, KVM will inject a page
fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e.
the end result is mostly same, but critically KVM will inject a fault only
*after* KVM runs the vCPU with the bogus root.
Waiting to inject a fault until after running the vCPU fixes a bug where
KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled
and a !visible root. Even though a bad root will *probably* lead to
shutdown, (a) it's not guaranteed and (b) the CPU won't read the
underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with
a VMX preemption timer value of '0', then architecturally the preemption
timer VM-Exit is guaranteed to occur before the CPU executes any
instruction, i.e. before the CPU needs to translate a GPA to a HPA (so
long as there are no injected events with higher priority than the
preemption timer).
If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because
userspace created a memslot between installing the dummy root and handling
the page fault, simply unload the MMU to allocate a new root and retry the
instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as
invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and
conceptually the dummy root has indeeed become obsolete. The only
difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is
that the root has become obsolete due to memslot *creation*, not memslot
deletion or movement.
Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:52:00 -07:00
/* Dummy roots are used only for shadowing bad guest roots. */
if ( WARN_ON_ONCE ( kvm_mmu_is_dummy_root ( vcpu - > arch . mmu - > root . hpa ) ) )
return RET_PF_RETRY ;
2021-08-06 04:21:58 -04:00
if ( page_fault_handle_page_track ( vcpu , fault ) )
2017-08-17 15:03:32 +02:00
return RET_PF_EMULATE ;
2011-07-12 03:33:44 +08:00
2021-08-06 04:35:50 -04:00
r = fast_page_fault ( vcpu , fault ) ;
2021-07-13 22:09:55 +00:00
if ( r ! = RET_PF_INVALID )
return r ;
2020-07-02 19:35:30 -07:00
2020-07-02 19:35:36 -07:00
r = mmu_topup_memory_caches ( vcpu , false ) ;
2007-01-05 16:36:54 -08:00
if ( r )
return r ;
2007-01-05 16:36:53 -08:00
2022-09-21 10:35:42 -07:00
r = kvm_faultin_pfn ( vcpu , fault , ACC_ALL ) ;
2022-04-23 03:47:46 +00:00
if ( r ! = RET_PF_CONTINUE )
2019-12-06 15:57:16 -08:00
return r ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
2019-12-06 15:57:16 -08:00
r = RET_PF_RETRY ;
2022-09-21 10:35:44 -07:00
write_lock ( & vcpu - > kvm - > mmu_lock ) ;
2021-02-02 10:57:29 -08:00
2022-09-21 10:35:39 -07:00
if ( is_page_fault_stale ( vcpu , fault ) )
2019-12-06 15:57:16 -08:00
goto out_unlock ;
2021-02-02 10:57:29 -08:00
2020-06-23 12:35:42 -07:00
r = make_mmu_pages_available ( vcpu ) ;
if ( r )
2019-12-06 15:57:16 -08:00
goto out_unlock ;
KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.
For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.
For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest. But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root. Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete. But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.
Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-20 04:50:22 +00:00
2022-09-21 10:35:46 -07:00
r = direct_map ( vcpu , fault ) ;
2019-12-06 15:57:24 -08:00
2019-12-06 15:57:16 -08:00
out_unlock :
2022-09-21 10:35:44 -07:00
write_unlock ( & vcpu - > kvm - > mmu_lock ) ;
2021-08-07 08:57:34 -04:00
kvm_release_pfn_clean ( fault - > pfn ) ;
2019-12-06 15:57:16 -08:00
return r ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2021-08-06 04:35:50 -04:00
static int nonpaging_page_fault ( struct kvm_vcpu * vcpu ,
struct kvm_page_fault * fault )
2019-12-06 15:57:24 -08:00
{
/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
2021-08-06 04:21:58 -04:00
fault - > max_level = PG_LEVEL_2M ;
return direct_page_fault ( vcpu , fault ) ;
2019-12-06 15:57:24 -08:00
}
2017-07-13 18:30:40 -07:00
int kvm_handle_page_fault ( struct kvm_vcpu * vcpu , u64 error_code ,
2017-08-11 18:36:43 +02:00
u64 fault_address , char * insn , int insn_len )
2017-07-13 18:30:40 -07:00
{
int r = 1 ;
2020-05-07 16:36:02 +02:00
u32 flags = vcpu - > arch . apf . host_apf_flags ;
2017-07-13 18:30:40 -07:00
2019-12-06 15:57:14 -08:00
# ifndef CONFIG_X86_64
/* A 64-bit CR2 should be impossible on 32-bit KVM. */
if ( WARN_ON_ONCE ( fault_address > > 32 ) )
return - EFAULT ;
# endif
x86/KVM/VMX: Add L1D flush logic
Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.
The flags is set:
- Always, if the flush module parameter is 'always'
- Conditionally at:
- Entry to vcpu_run(), i.e. after executing user space
- From the sched_in notifier, i.e. when switching to a vCPU thread.
- From vmexit handlers which are considered unsafe, i.e. where
sensitive data can be brought into L1D:
- The emulator, which could be a good target for other speculative
execution-based threats,
- The MMU, which can bring host page tables in the L1 cache.
- External interrupts
- Nested operations that require the MMU (see above). That is
vmptrld, vmptrst, vmclear,vmwrite,vmread.
- When handling invept,invvpid
[ tglx: Split out from combo patch and reduced to a single flag ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-02 13:07:14 +02:00
vcpu - > arch . l1tf_flush_l1d = true ;
2020-05-07 16:36:02 +02:00
if ( ! flags ) {
2022-05-10 16:10:00 +09:00
trace_kvm_page_fault ( vcpu , fault_address , error_code ) ;
2017-07-13 18:30:40 -07:00
2017-08-11 18:36:43 +02:00
if ( kvm_event_needs_reinjection ( vcpu ) )
2017-07-13 18:30:40 -07:00
kvm_mmu_unprotect_page_virt ( vcpu , fault_address ) ;
r = kvm_mmu_page_fault ( vcpu , fault_address , error_code , insn ,
insn_len ) ;
2020-05-07 16:36:02 +02:00
} else if ( flags & KVM_PV_REASON_PAGE_NOT_PRESENT ) {
2020-05-25 16:41:17 +02:00
vcpu - > arch . apf . host_apf_flags = 0 ;
2017-07-13 18:30:40 -07:00
local_irq_disable ( ) ;
2020-03-07 00:42:06 +01:00
kvm_async_pf_task_wait_schedule ( fault_address ) ;
2017-07-13 18:30:40 -07:00
local_irq_enable ( ) ;
2020-05-07 16:36:02 +02:00
} else {
WARN_ONCE ( 1 , " Unexpected host async PF flags: %x \n " , flags ) ;
2017-07-13 18:30:40 -07:00
}
2020-05-07 16:36:02 +02:00
2017-07-13 18:30:40 -07:00
return r ;
}
EXPORT_SYMBOL_GPL ( kvm_handle_page_fault ) ;
2022-09-21 10:35:44 -07:00
# ifdef CONFIG_X86_64
static int kvm_tdp_mmu_page_fault ( struct kvm_vcpu * vcpu ,
struct kvm_page_fault * fault )
{
int r ;
if ( page_fault_handle_page_track ( vcpu , fault ) )
return RET_PF_EMULATE ;
r = fast_page_fault ( vcpu , fault ) ;
if ( r ! = RET_PF_INVALID )
return r ;
r = mmu_topup_memory_caches ( vcpu , false ) ;
if ( r )
return r ;
r = kvm_faultin_pfn ( vcpu , fault , ACC_ALL ) ;
if ( r ! = RET_PF_CONTINUE )
return r ;
r = RET_PF_RETRY ;
read_lock ( & vcpu - > kvm - > mmu_lock ) ;
if ( is_page_fault_stale ( vcpu , fault ) )
goto out_unlock ;
r = kvm_tdp_mmu_map ( vcpu , fault ) ;
out_unlock :
read_unlock ( & vcpu - > kvm - > mmu_lock ) ;
kvm_release_pfn_clean ( fault - > pfn ) ;
return r ;
}
# endif
2023-07-14 14:50:06 +08:00
bool __kvm_mmu_honors_guest_mtrrs ( bool vm_has_noncoherent_dma )
2008-02-07 13:47:44 +01:00
{
2022-07-15 23:00:16 +00:00
/*
2023-07-14 14:50:06 +08:00
* If host MTRRs are ignored ( shadow_memtype_mask is non - zero ) , and the
* VM has non - coherent DMA ( DMA doesn ' t snoop CPU caches ) , KVM ' s ABI is
* to honor the memtype from the guest ' s MTRRs so that guest accesses
* to memory that is DMA ' d aren ' t cached against the guest ' s wishes .
2022-07-15 23:00:16 +00:00
*
* Note , KVM may still ultimately ignore guest MTRRs for certain PFNs ,
* e . g . KVM will force UC memtype for host MMIO .
*/
2023-07-14 14:50:06 +08:00
return vm_has_noncoherent_dma & & shadow_memtype_mask ;
}
int kvm_tdp_page_fault ( struct kvm_vcpu * vcpu , struct kvm_page_fault * fault )
{
/*
* If the guest ' s MTRRs may be used to compute the " real " memtype ,
* restrict the mapping level to ensure KVM uses a consistent memtype
* across the entire mapping .
*/
if ( kvm_mmu_honors_guest_mtrrs ( vcpu - > kvm ) ) {
2022-07-15 23:00:16 +00:00
for ( ; fault - > max_level > PG_LEVEL_4K ; - - fault - > max_level ) {
int page_num = KVM_PAGES_PER_HPAGE ( fault - > max_level ) ;
2022-10-10 20:19:12 +08:00
gfn_t base = gfn_round_for_level ( fault - > gfn ,
fault - > max_level ) ;
2021-08-06 04:21:58 -04:00
2022-07-15 23:00:16 +00:00
if ( kvm_mtrr_check_gfn_range_consistency ( vcpu , base , page_num ) )
break ;
}
2015-10-16 17:06:02 +09:00
}
2009-07-27 16:30:44 +02:00
2022-09-21 10:35:44 -07:00
# ifdef CONFIG_X86_64
if ( tdp_mmu_enabled )
return kvm_tdp_mmu_page_fault ( vcpu , fault ) ;
# endif
2021-08-06 04:21:58 -04:00
return direct_page_fault ( vcpu , fault ) ;
2008-02-07 13:47:44 +01:00
}
2021-06-22 10:57:21 -07:00
static void nonpaging_init_context ( struct kvm_mmu * context )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
context - > page_fault = nonpaging_page_fault ;
context - > gva_to_gpa = nonpaging_gva_to_gpa ;
2023-02-16 23:41:11 +08:00
context - > sync_spte = NULL ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2020-03-20 14:28:32 -07:00
static inline bool is_root_usable ( struct kvm_mmu_root_info * root , gpa_t pgd ,
2020-02-28 14:52:40 -08:00
union kvm_mmu_page_role role )
{
2023-07-28 17:51:57 -07:00
struct kvm_mmu_page * sp ;
if ( ! VALID_PAGE ( root - > hpa ) )
return false ;
if ( ! role . direct & & pgd ! = root - > pgd )
return false ;
sp = root_to_sp ( root - > hpa ) ;
if ( WARN_ON_ONCE ( ! sp ) )
return false ;
return role . word = = sp - > role . word ;
2020-02-28 14:52:40 -08:00
}
2018-06-27 14:59:20 -07:00
/*
2022-02-09 02:49:47 -05:00
* Find out if a previously cached root matching the new pgd / role is available ,
* and insert the current root as the MRU in the cache .
* If a matching root is found , it is assigned to kvm_mmu - > root and
* true is returned .
* If no match is found , kvm_mmu - > root is left invalid , the LRU root is
* evicted to make room for the current root , and false is returned .
2018-06-27 14:59:20 -07:00
*/
2022-02-09 02:49:47 -05:00
static bool cached_root_find_and_keep_current ( struct kvm * kvm , struct kvm_mmu * mmu ,
gpa_t new_pgd ,
union kvm_mmu_page_role new_role )
2018-06-27 14:59:20 -07:00
{
uint i ;
2022-02-21 09:28:33 -05:00
if ( is_root_usable ( & mmu - > root , new_pgd , new_role ) )
2020-02-28 14:52:40 -08:00
return true ;
2018-06-27 14:59:20 -07:00
for ( i = 0 ; i < KVM_MMU_NUM_PREV_ROOTS ; i + + ) {
2022-02-09 02:49:47 -05:00
/*
* The swaps end up rotating the cache like this :
* C 0 1 2 3 ( on entry to the function )
* 0 C 1 2 3
* 1 C 0 2 3
* 2 C 0 1 3
* 3 C 0 1 2 ( on exit from the loop )
*/
2022-02-21 09:28:33 -05:00
swap ( mmu - > root , mmu - > prev_roots [ i ] ) ;
if ( is_root_usable ( & mmu - > root , new_pgd , new_role ) )
2022-02-09 02:49:47 -05:00
return true ;
2018-06-27 14:59:20 -07:00
}
2022-02-09 02:49:47 -05:00
kvm_mmu_free_roots ( kvm , mmu , KVM_MMU_ROOT_CURRENT ) ;
return false ;
2018-06-27 14:59:20 -07:00
}
2022-02-09 02:49:47 -05:00
/*
* Find out if a previously cached root matching the new pgd / role is available .
* On entry , mmu - > root is invalid .
* If a matching root is found , it is assigned to kvm_mmu - > root , the LRU entry
* of the cache becomes invalid , and true is returned .
* If no match is found , kvm_mmu - > root is left invalid and false is returned .
*/
static bool cached_root_find_without_current ( struct kvm * kvm , struct kvm_mmu * mmu ,
gpa_t new_pgd ,
union kvm_mmu_page_role new_role )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
2022-02-09 02:49:47 -05:00
uint i ;
for ( i = 0 ; i < KVM_MMU_NUM_PREV_ROOTS ; i + + )
if ( is_root_usable ( & mmu - > prev_roots [ i ] , new_pgd , new_role ) )
goto hit ;
2018-06-27 14:59:06 -07:00
2022-02-09 02:49:47 -05:00
return false ;
hit :
swap ( mmu - > root , mmu - > prev_roots [ i ] ) ;
/* Bubble up the remaining roots. */
for ( ; i < KVM_MMU_NUM_PREV_ROOTS - 1 ; i + + )
mmu - > prev_roots [ i ] = mmu - > prev_roots [ i + 1 ] ;
mmu - > prev_roots [ i ] . hpa = INVALID_PAGE ;
return true ;
}
static bool fast_pgd_switch ( struct kvm * kvm , struct kvm_mmu * mmu ,
gpa_t new_pgd , union kvm_mmu_page_role new_role )
{
2018-06-27 14:59:06 -07:00
/*
KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
When attempting to allocate a shadow root for a !visible guest root gfn,
e.g. that resides in MMIO space, load a dummy root that is backed by the
zero page instead of immediately synthesizing a triple fault shutdown
(using the zero page ensures any attempt to translate memory will generate
a !PRESENT fault and thus VM-Exit).
Unless the vCPU is racing with memslot activity, KVM will inject a page
fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e.
the end result is mostly same, but critically KVM will inject a fault only
*after* KVM runs the vCPU with the bogus root.
Waiting to inject a fault until after running the vCPU fixes a bug where
KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled
and a !visible root. Even though a bad root will *probably* lead to
shutdown, (a) it's not guaranteed and (b) the CPU won't read the
underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with
a VMX preemption timer value of '0', then architecturally the preemption
timer VM-Exit is guaranteed to occur before the CPU executes any
instruction, i.e. before the CPU needs to translate a GPA to a HPA (so
long as there are no injected events with higher priority than the
preemption timer).
If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because
userspace created a memslot between installing the dummy root and handling
the page fault, simply unload the MMU to allocate a new root and retry the
instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as
invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and
conceptually the dummy root has indeeed become obsolete. The only
difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is
that the root has become obsolete due to memslot *creation*, not memslot
deletion or movement.
Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:52:00 -07:00
* Limit reuse to 64 - bit hosts + VMs without " special " roots in order to
* avoid having to deal with PDPTEs and other complexities .
2018-06-27 14:59:06 -07:00
*/
2023-07-28 17:51:56 -07:00
if ( VALID_PAGE ( mmu - > root . hpa ) & & ! root_to_sp ( mmu - > root . hpa ) )
2022-02-09 02:49:47 -05:00
kvm_mmu_free_roots ( kvm , mmu , KVM_MMU_ROOT_CURRENT ) ;
2018-06-27 14:59:06 -07:00
2022-02-09 02:49:47 -05:00
if ( VALID_PAGE ( mmu - > root . hpa ) )
return cached_root_find_and_keep_current ( kvm , mmu , new_pgd , new_role ) ;
else
return cached_root_find_without_current ( kvm , mmu , new_pgd , new_role ) ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2021-11-22 13:18:23 -05:00
void kvm_mmu_new_pgd ( struct kvm_vcpu * vcpu , gpa_t new_pgd )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
2022-02-21 09:31:51 -05:00
struct kvm_mmu * mmu = vcpu - > arch . mmu ;
2022-02-14 08:46:24 -05:00
union kvm_mmu_page_role new_role = mmu - > root_role ;
2022-02-21 09:31:51 -05:00
2022-11-28 21:47:09 +00:00
/*
* Return immediately if no usable root was found , kvm_mmu_reload ( )
* will establish a valid root prior to the next VM - Enter .
*/
if ( ! fast_pgd_switch ( vcpu - > kvm , mmu , new_pgd , new_role ) )
2020-03-20 14:28:26 -07:00
return ;
/*
* It ' s possible that the cached previous root page is obsolete because
* of a change in the MMU generation number . However , changing the
2022-02-25 18:22:45 +00:00
* generation number is accompanied by KVM_REQ_MMU_FREE_OBSOLETE_ROOTS ,
* which will free the root set here and allocate a new one .
2020-03-20 14:28:26 -07:00
*/
kvm_make_request ( KVM_REQ_LOAD_MMU_PGD , vcpu ) ;
2021-06-09 16:42:27 -07:00
if ( force_flush_and_sync_on_reuse ) {
2020-03-20 14:28:26 -07:00
kvm_make_request ( KVM_REQ_MMU_SYNC , vcpu ) ;
kvm_make_request ( KVM_REQ_TLB_FLUSH_CURRENT , vcpu ) ;
2021-06-09 16:42:27 -07:00
}
2020-03-20 14:28:26 -07:00
/*
* The last MMIO access ' s GVA and GPA are cached in the VCPU . When
* switching to a new CR3 , that GVA - > GPA mapping may no longer be
* valid . So clear any cached MMIO info even when we don ' t need to sync
* the shadow page tables .
*/
vcpu_clear_mmio_info ( vcpu , MMIO_GVA_ANY ) ;
2020-10-14 11:26:59 -07:00
/*
* If this is a direct root page , it doesn ' t have a write flooding
* count . Otherwise , clear the write flooding count .
*/
2023-07-28 17:51:57 -07:00
if ( ! new_role . direct ) {
struct kvm_mmu_page * sp = root_to_sp ( vcpu - > arch . mmu - > root . hpa ) ;
if ( ! WARN_ON_ONCE ( ! sp ) )
__clear_sp_write_flooding_count ( sp ) ;
}
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2020-03-20 14:28:32 -07:00
EXPORT_SYMBOL_GPL ( kvm_mmu_new_pgd ) ;
2018-06-27 14:59:09 -07:00
2015-04-08 15:39:23 +02:00
static bool sync_mmio_spte ( struct kvm_vcpu * vcpu , u64 * sptep , gfn_t gfn ,
2021-09-18 08:56:32 +08:00
unsigned int access )
2011-07-12 03:33:44 +08:00
{
if ( unlikely ( is_mmio_spte ( * sptep ) ) ) {
if ( gfn ! = get_mmio_spte_gfn ( * sptep ) ) {
mmu_spte_clear_no_track ( sptep ) ;
return true ;
}
2015-04-08 15:39:23 +02:00
mark_mmio_spte ( vcpu , sptep , gfn , access ) ;
2011-07-12 03:33:44 +08:00
return true ;
}
return false ;
}
2013-08-05 11:07:12 +03:00
# define PTTYPE_EPT 18 /* arbitrary */
# define PTTYPE PTTYPE_EPT
# include "paging_tmpl.h"
# undef PTTYPE
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
# define PTTYPE 64
# include "paging_tmpl.h"
# undef PTTYPE
# define PTTYPE 32
# include "paging_tmpl.h"
# undef PTTYPE
2023-02-02 18:27:51 +00:00
static void __reset_rsvds_bits_mask ( struct rsvd_bits_validate * rsvd_check ,
u64 pa_bits_rsvd , int level , bool nx ,
bool gbpages , bool pse , bool amd )
2009-03-30 16:21:08 +08:00
{
2014-05-07 15:32:50 +03:00
u64 gbpages_bit_rsvd = 0 ;
2014-09-02 13:24:12 +02:00
u64 nonleaf_bit8_rsvd = 0 ;
KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr
rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE)
reserved bits checks. For SEV, the C-bit is ignored by hardware when
walking pages tables, e.g. the APM states:
Note that while the guest may choose to set the C-bit explicitly on
instruction pages and page table addresses, the value of this bit is a
don't-care in such situations as hardware always performs these as
private accesses.
Such behavior is expected to hold true for other features that repurpose
GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both
allow non-zero repurposed bits in the page tables. Conceptually, KVM
should apply reserved GPA checks universally, and any features that do
not adhere to the basic rule should be explicitly handled, i.e. if a GPA
bit is repurposed but not allowed in page tables for whatever reason.
Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved
bits mask, and opportunistically clean up its code, e.g. to align lines
and comments.
Practically speaking, this is change is a likely a glorified nop given
the current KVM code base. SEV's C-bit is the only repurposed GPA bit,
and KVM doesn't support shadowing encrypted page tables (which is
theoretically possible via SEV debug APIs).
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03 16:01:13 -08:00
u64 high_bits_rsvd ;
2009-03-30 16:21:08 +08:00
2015-08-05 12:04:21 +08:00
rsvd_check - > bad_mt_xwr = 0 ;
2013-08-06 12:00:32 +03:00
2015-08-05 12:04:22 +08:00
if ( ! gbpages )
2014-05-07 15:32:50 +03:00
gbpages_bit_rsvd = rsvd_bits ( 7 , 7 ) ;
2014-09-02 13:24:12 +02:00
KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr
rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE)
reserved bits checks. For SEV, the C-bit is ignored by hardware when
walking pages tables, e.g. the APM states:
Note that while the guest may choose to set the C-bit explicitly on
instruction pages and page table addresses, the value of this bit is a
don't-care in such situations as hardware always performs these as
private accesses.
Such behavior is expected to hold true for other features that repurpose
GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both
allow non-zero repurposed bits in the page tables. Conceptually, KVM
should apply reserved GPA checks universally, and any features that do
not adhere to the basic rule should be explicitly handled, i.e. if a GPA
bit is repurposed but not allowed in page tables for whatever reason.
Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved
bits mask, and opportunistically clean up its code, e.g. to align lines
and comments.
Practically speaking, this is change is a likely a glorified nop given
the current KVM code base. SEV's C-bit is the only repurposed GPA bit,
and KVM doesn't support shadowing encrypted page tables (which is
theoretically possible via SEV debug APIs).
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03 16:01:13 -08:00
if ( level = = PT32E_ROOT_LEVEL )
high_bits_rsvd = pa_bits_rsvd & rsvd_bits ( 0 , 62 ) ;
else
high_bits_rsvd = pa_bits_rsvd & rsvd_bits ( 0 , 51 ) ;
/* Note, NX doesn't exist in PDPTEs, this is handled below. */
if ( ! nx )
high_bits_rsvd | = rsvd_bits ( 63 , 63 ) ;
2014-09-02 13:24:12 +02:00
/*
* Non - leaf PML4Es and PDPEs reserve bit 8 ( which would be the G bit for
* leaf entries ) on AMD CPUs only .
*/
2015-09-22 23:02:14 +02:00
if ( amd )
2014-09-02 13:24:12 +02:00
nonleaf_bit8_rsvd = rsvd_bits ( 8 , 8 ) ;
2015-08-05 12:04:22 +08:00
switch ( level ) {
2009-03-30 16:21:08 +08:00
case PT32_ROOT_LEVEL :
/* no rsvd bits for 2 level 4K page table entries */
2015-08-05 12:04:21 +08:00
rsvd_check - > rsvd_bits_mask [ 0 ] [ 1 ] = 0 ;
rsvd_check - > rsvd_bits_mask [ 0 ] [ 0 ] = 0 ;
rsvd_check - > rsvd_bits_mask [ 1 ] [ 0 ] =
rsvd_check - > rsvd_bits_mask [ 0 ] [ 0 ] ;
2010-03-19 17:58:53 +08:00
2015-08-05 12:04:22 +08:00
if ( ! pse ) {
2015-08-05 12:04:21 +08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 1 ] = 0 ;
2010-03-19 17:58:53 +08:00
break ;
}
2009-03-30 16:21:08 +08:00
if ( is_cpuid_PSE36 ( ) )
/* 36bits PSE 4MB page */
2015-08-05 12:04:21 +08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 1 ] = rsvd_bits ( 17 , 21 ) ;
2009-03-30 16:21:08 +08:00
else
/* 32 bits PSE 4MB page */
2015-08-05 12:04:21 +08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 1 ] = rsvd_bits ( 13 , 21 ) ;
2009-03-30 16:21:08 +08:00
break ;
case PT32E_ROOT_LEVEL :
KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr
rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE)
reserved bits checks. For SEV, the C-bit is ignored by hardware when
walking pages tables, e.g. the APM states:
Note that while the guest may choose to set the C-bit explicitly on
instruction pages and page table addresses, the value of this bit is a
don't-care in such situations as hardware always performs these as
private accesses.
Such behavior is expected to hold true for other features that repurpose
GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both
allow non-zero repurposed bits in the page tables. Conceptually, KVM
should apply reserved GPA checks universally, and any features that do
not adhere to the basic rule should be explicitly handled, i.e. if a GPA
bit is repurposed but not allowed in page tables for whatever reason.
Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved
bits mask, and opportunistically clean up its code, e.g. to align lines
and comments.
Practically speaking, this is change is a likely a glorified nop given
the current KVM code base. SEV's C-bit is the only repurposed GPA bit,
and KVM doesn't support shadowing encrypted page tables (which is
theoretically possible via SEV debug APIs).
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03 16:01:13 -08:00
rsvd_check - > rsvd_bits_mask [ 0 ] [ 2 ] = rsvd_bits ( 63 , 63 ) |
high_bits_rsvd |
rsvd_bits ( 5 , 8 ) |
rsvd_bits ( 1 , 2 ) ; /* PDPTE */
rsvd_check - > rsvd_bits_mask [ 0 ] [ 1 ] = high_bits_rsvd ; /* PDE */
rsvd_check - > rsvd_bits_mask [ 0 ] [ 0 ] = high_bits_rsvd ; /* PTE */
rsvd_check - > rsvd_bits_mask [ 1 ] [ 1 ] = high_bits_rsvd |
rsvd_bits ( 13 , 20 ) ; /* large page */
2015-08-05 12:04:21 +08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 0 ] =
rsvd_check - > rsvd_bits_mask [ 0 ] [ 0 ] ;
2009-03-30 16:21:08 +08:00
break ;
2017-08-24 20:27:55 +08:00
case PT64_ROOT_5LEVEL :
KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr
rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE)
reserved bits checks. For SEV, the C-bit is ignored by hardware when
walking pages tables, e.g. the APM states:
Note that while the guest may choose to set the C-bit explicitly on
instruction pages and page table addresses, the value of this bit is a
don't-care in such situations as hardware always performs these as
private accesses.
Such behavior is expected to hold true for other features that repurpose
GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both
allow non-zero repurposed bits in the page tables. Conceptually, KVM
should apply reserved GPA checks universally, and any features that do
not adhere to the basic rule should be explicitly handled, i.e. if a GPA
bit is repurposed but not allowed in page tables for whatever reason.
Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved
bits mask, and opportunistically clean up its code, e.g. to align lines
and comments.
Practically speaking, this is change is a likely a glorified nop given
the current KVM code base. SEV's C-bit is the only repurposed GPA bit,
and KVM doesn't support shadowing encrypted page tables (which is
theoretically possible via SEV debug APIs).
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03 16:01:13 -08:00
rsvd_check - > rsvd_bits_mask [ 0 ] [ 4 ] = high_bits_rsvd |
nonleaf_bit8_rsvd |
rsvd_bits ( 7 , 7 ) ;
2017-08-24 20:27:55 +08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 4 ] =
rsvd_check - > rsvd_bits_mask [ 0 ] [ 4 ] ;
2020-08-23 17:36:59 -05:00
fallthrough ;
2017-08-24 20:27:54 +08:00
case PT64_ROOT_4LEVEL :
KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr
rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE)
reserved bits checks. For SEV, the C-bit is ignored by hardware when
walking pages tables, e.g. the APM states:
Note that while the guest may choose to set the C-bit explicitly on
instruction pages and page table addresses, the value of this bit is a
don't-care in such situations as hardware always performs these as
private accesses.
Such behavior is expected to hold true for other features that repurpose
GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both
allow non-zero repurposed bits in the page tables. Conceptually, KVM
should apply reserved GPA checks universally, and any features that do
not adhere to the basic rule should be explicitly handled, i.e. if a GPA
bit is repurposed but not allowed in page tables for whatever reason.
Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved
bits mask, and opportunistically clean up its code, e.g. to align lines
and comments.
Practically speaking, this is change is a likely a glorified nop given
the current KVM code base. SEV's C-bit is the only repurposed GPA bit,
and KVM doesn't support shadowing encrypted page tables (which is
theoretically possible via SEV debug APIs).
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03 16:01:13 -08:00
rsvd_check - > rsvd_bits_mask [ 0 ] [ 3 ] = high_bits_rsvd |
nonleaf_bit8_rsvd |
rsvd_bits ( 7 , 7 ) ;
rsvd_check - > rsvd_bits_mask [ 0 ] [ 2 ] = high_bits_rsvd |
gbpages_bit_rsvd ;
rsvd_check - > rsvd_bits_mask [ 0 ] [ 1 ] = high_bits_rsvd ;
rsvd_check - > rsvd_bits_mask [ 0 ] [ 0 ] = high_bits_rsvd ;
2015-08-05 12:04:21 +08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 3 ] =
rsvd_check - > rsvd_bits_mask [ 0 ] [ 3 ] ;
KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr
rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE)
reserved bits checks. For SEV, the C-bit is ignored by hardware when
walking pages tables, e.g. the APM states:
Note that while the guest may choose to set the C-bit explicitly on
instruction pages and page table addresses, the value of this bit is a
don't-care in such situations as hardware always performs these as
private accesses.
Such behavior is expected to hold true for other features that repurpose
GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both
allow non-zero repurposed bits in the page tables. Conceptually, KVM
should apply reserved GPA checks universally, and any features that do
not adhere to the basic rule should be explicitly handled, i.e. if a GPA
bit is repurposed but not allowed in page tables for whatever reason.
Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved
bits mask, and opportunistically clean up its code, e.g. to align lines
and comments.
Practically speaking, this is change is a likely a glorified nop given
the current KVM code base. SEV's C-bit is the only repurposed GPA bit,
and KVM doesn't support shadowing encrypted page tables (which is
theoretically possible via SEV debug APIs).
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03 16:01:13 -08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 2 ] = high_bits_rsvd |
gbpages_bit_rsvd |
rsvd_bits ( 13 , 29 ) ;
rsvd_check - > rsvd_bits_mask [ 1 ] [ 1 ] = high_bits_rsvd |
rsvd_bits ( 13 , 20 ) ; /* large page */
2015-08-05 12:04:21 +08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 0 ] =
rsvd_check - > rsvd_bits_mask [ 0 ] [ 0 ] ;
2009-03-30 16:21:08 +08:00
break ;
}
}
2022-04-19 23:17:02 +12:00
static void reset_guest_rsvds_bits_mask ( struct kvm_vcpu * vcpu ,
struct kvm_mmu * context )
2015-08-05 12:04:22 +08:00
{
2021-06-22 10:57:16 -07:00
__reset_rsvds_bits_mask ( & context - > guest_rsvd_check ,
KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr
rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE)
reserved bits checks. For SEV, the C-bit is ignored by hardware when
walking pages tables, e.g. the APM states:
Note that while the guest may choose to set the C-bit explicitly on
instruction pages and page table addresses, the value of this bit is a
don't-care in such situations as hardware always performs these as
private accesses.
Such behavior is expected to hold true for other features that repurpose
GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both
allow non-zero repurposed bits in the page tables. Conceptually, KVM
should apply reserved GPA checks universally, and any features that do
not adhere to the basic rule should be explicitly handled, i.e. if a GPA
bit is repurposed but not allowed in page tables for whatever reason.
Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved
bits mask, and opportunistically clean up its code, e.g. to align lines
and comments.
Practically speaking, this is change is a likely a glorified nop given
the current KVM code base. SEV's C-bit is the only repurposed GPA bit,
and KVM doesn't support shadowing encrypted page tables (which is
theoretically possible via SEV debug APIs).
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03 16:01:13 -08:00
vcpu - > arch . reserved_gpa_bits ,
2022-02-10 07:42:22 -05:00
context - > cpu_role . base . level , is_efer_nx ( context ) ,
2023-08-15 13:36:40 -07:00
guest_can_use ( vcpu , X86_FEATURE_GBPAGES ) ,
2021-06-22 10:57:15 -07:00
is_cr4_pse ( context ) ,
KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible
Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is
compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with
helpers to check if a vCPU is compatible AMD vs. Intel. To handle Intel
vs. AMD behavior related to masking the LVTPC entry, KVM will need to
check for vendor compatibility on every PMI injection, i.e. querying for
AMD will soon be a moderately hot path.
Note! This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's
default behavior, both if userspace omits (or never sets) CPUID 0x0 and if
userspace sets a completely unknown vendor. One could argue that KVM
should treat such vCPUs as not being compatible with Intel *or* AMD, but
that would add useless complexity to KVM.
KVM needs to do *something* in the face of vendor specific behavior, and
so unless KVM conjured up a magic third option, choosing to treat unknown
vendors as neither Intel nor AMD means that checks on AMD compatibility
would yield Intel behavior, and checks for Intel compatibility would yield
AMD behavior. And that's far worse as it would effectively yield random
behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs.
!Intel. And practically speaking, all x86 CPUs follow either Intel or AMD
architecture, i.e. "supporting" an unknown third architecture adds no
value.
Deliberately don't convert any of the existing guest_cpuid_is_intel()
checks, as the Intel side of things is messier due to some flows explicitly
checking for exactly vendor==Intel, versus some flows assuming anything
that isn't "AMD compatible" gets Intel behavior. The Intel code will be
cleaned up in the future.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240405235603.1173076-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-05 16:55:54 -07:00
guest_cpuid_is_amd_compatible ( vcpu ) ) ;
2015-08-05 12:04:22 +08:00
}
2023-02-02 18:27:51 +00:00
static void __reset_rsvds_bits_mask_ept ( struct rsvd_bits_validate * rsvd_check ,
u64 pa_bits_rsvd , bool execonly ,
int huge_page_level )
2013-08-06 12:00:32 +03:00
{
KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr
rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE)
reserved bits checks. For SEV, the C-bit is ignored by hardware when
walking pages tables, e.g. the APM states:
Note that while the guest may choose to set the C-bit explicitly on
instruction pages and page table addresses, the value of this bit is a
don't-care in such situations as hardware always performs these as
private accesses.
Such behavior is expected to hold true for other features that repurpose
GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both
allow non-zero repurposed bits in the page tables. Conceptually, KVM
should apply reserved GPA checks universally, and any features that do
not adhere to the basic rule should be explicitly handled, i.e. if a GPA
bit is repurposed but not allowed in page tables for whatever reason.
Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved
bits mask, and opportunistically clean up its code, e.g. to align lines
and comments.
Practically speaking, this is change is a likely a glorified nop given
the current KVM code base. SEV's C-bit is the only repurposed GPA bit,
and KVM doesn't support shadowing encrypted page tables (which is
theoretically possible via SEV debug APIs).
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03 16:01:13 -08:00
u64 high_bits_rsvd = pa_bits_rsvd & rsvd_bits ( 0 , 51 ) ;
2021-11-24 20:20:48 +08:00
u64 large_1g_rsvd = 0 , large_2m_rsvd = 0 ;
2015-09-23 10:34:26 +02:00
u64 bad_mt_xwr ;
2013-08-06 12:00:32 +03:00
2021-11-24 20:20:48 +08:00
if ( huge_page_level < PG_LEVEL_1G )
large_1g_rsvd = rsvd_bits ( 7 , 7 ) ;
if ( huge_page_level < PG_LEVEL_2M )
large_2m_rsvd = rsvd_bits ( 7 , 7 ) ;
KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr
rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE)
reserved bits checks. For SEV, the C-bit is ignored by hardware when
walking pages tables, e.g. the APM states:
Note that while the guest may choose to set the C-bit explicitly on
instruction pages and page table addresses, the value of this bit is a
don't-care in such situations as hardware always performs these as
private accesses.
Such behavior is expected to hold true for other features that repurpose
GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both
allow non-zero repurposed bits in the page tables. Conceptually, KVM
should apply reserved GPA checks universally, and any features that do
not adhere to the basic rule should be explicitly handled, i.e. if a GPA
bit is repurposed but not allowed in page tables for whatever reason.
Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved
bits mask, and opportunistically clean up its code, e.g. to align lines
and comments.
Practically speaking, this is change is a likely a glorified nop given
the current KVM code base. SEV's C-bit is the only repurposed GPA bit,
and KVM doesn't support shadowing encrypted page tables (which is
theoretically possible via SEV debug APIs).
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03 16:01:13 -08:00
rsvd_check - > rsvd_bits_mask [ 0 ] [ 4 ] = high_bits_rsvd | rsvd_bits ( 3 , 7 ) ;
rsvd_check - > rsvd_bits_mask [ 0 ] [ 3 ] = high_bits_rsvd | rsvd_bits ( 3 , 7 ) ;
2021-11-24 20:20:48 +08:00
rsvd_check - > rsvd_bits_mask [ 0 ] [ 2 ] = high_bits_rsvd | rsvd_bits ( 3 , 6 ) | large_1g_rsvd ;
rsvd_check - > rsvd_bits_mask [ 0 ] [ 1 ] = high_bits_rsvd | rsvd_bits ( 3 , 6 ) | large_2m_rsvd ;
KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr
rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE)
reserved bits checks. For SEV, the C-bit is ignored by hardware when
walking pages tables, e.g. the APM states:
Note that while the guest may choose to set the C-bit explicitly on
instruction pages and page table addresses, the value of this bit is a
don't-care in such situations as hardware always performs these as
private accesses.
Such behavior is expected to hold true for other features that repurpose
GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both
allow non-zero repurposed bits in the page tables. Conceptually, KVM
should apply reserved GPA checks universally, and any features that do
not adhere to the basic rule should be explicitly handled, i.e. if a GPA
bit is repurposed but not allowed in page tables for whatever reason.
Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved
bits mask, and opportunistically clean up its code, e.g. to align lines
and comments.
Practically speaking, this is change is a likely a glorified nop given
the current KVM code base. SEV's C-bit is the only repurposed GPA bit,
and KVM doesn't support shadowing encrypted page tables (which is
theoretically possible via SEV debug APIs).
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03 16:01:13 -08:00
rsvd_check - > rsvd_bits_mask [ 0 ] [ 0 ] = high_bits_rsvd ;
2013-08-06 12:00:32 +03:00
/* large page */
2017-08-24 20:27:55 +08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 4 ] = rsvd_check - > rsvd_bits_mask [ 0 ] [ 4 ] ;
2015-08-05 12:04:21 +08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 3 ] = rsvd_check - > rsvd_bits_mask [ 0 ] [ 3 ] ;
2021-11-24 20:20:48 +08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 2 ] = high_bits_rsvd | rsvd_bits ( 12 , 29 ) | large_1g_rsvd ;
rsvd_check - > rsvd_bits_mask [ 1 ] [ 1 ] = high_bits_rsvd | rsvd_bits ( 12 , 20 ) | large_2m_rsvd ;
2015-08-05 12:04:21 +08:00
rsvd_check - > rsvd_bits_mask [ 1 ] [ 0 ] = rsvd_check - > rsvd_bits_mask [ 0 ] [ 0 ] ;
2013-08-06 12:00:32 +03:00
2015-09-23 10:34:26 +02:00
bad_mt_xwr = 0xFFull < < ( 2 * 8 ) ; /* bits 3..5 must not be 2 */
bad_mt_xwr | = 0xFFull < < ( 3 * 8 ) ; /* bits 3..5 must not be 3 */
bad_mt_xwr | = 0xFFull < < ( 7 * 8 ) ; /* bits 3..5 must not be 7 */
bad_mt_xwr | = REPEAT_BYTE ( 1ull < < 2 ) ; /* bits 0..2 must not be 010 */
bad_mt_xwr | = REPEAT_BYTE ( 1ull < < 6 ) ; /* bits 0..2 must not be 110 */
if ( ! execonly ) {
/* bits 0..2 must not be 100 unless VMX capabilities allow it */
bad_mt_xwr | = REPEAT_BYTE ( 1ull < < 4 ) ;
2013-08-06 12:00:32 +03:00
}
2015-09-23 10:34:26 +02:00
rsvd_check - > bad_mt_xwr = bad_mt_xwr ;
2013-08-06 12:00:32 +03:00
}
2015-08-05 12:04:23 +08:00
static void reset_rsvds_bits_mask_ept ( struct kvm_vcpu * vcpu ,
2021-11-24 20:20:48 +08:00
struct kvm_mmu * context , bool execonly , int huge_page_level )
2015-08-05 12:04:23 +08:00
{
__reset_rsvds_bits_mask_ept ( & context - > guest_rsvd_check ,
2021-11-24 20:20:48 +08:00
vcpu - > arch . reserved_gpa_bits , execonly ,
huge_page_level ) ;
2015-08-05 12:04:23 +08:00
}
2021-02-03 16:01:14 -08:00
static inline u64 reserved_hpa_bits ( void )
{
return rsvd_bits ( shadow_phys_bits , 63 ) ;
}
2015-08-05 12:04:24 +08:00
/*
* the page table on host is the shadow page table for the page
* table in guest or amd nested guest , its mmu features completely
* follow the features in guest .
*/
2021-06-22 10:57:03 -07:00
static void reset_shadow_zero_bits_mask ( struct kvm_vcpu * vcpu ,
struct kvm_mmu * context )
2015-08-05 12:04:24 +08:00
{
2021-06-22 10:57:14 -07:00
/* @amd adds a check on bit of SPTEs, which KVM shouldn't use anyways. */
bool is_amd = true ;
/* KVM doesn't use 2-level page tables for the shadow MMU. */
bool is_pse = false ;
2017-08-25 15:55:40 -05:00
struct rsvd_bits_validate * shadow_zero_check ;
int i ;
2016-03-09 14:28:02 +01:00
2022-02-10 07:41:19 -05:00
WARN_ON_ONCE ( context - > root_role . level < PT32E_ROOT_LEVEL ) ;
2021-06-22 10:57:14 -07:00
2017-08-25 15:55:40 -05:00
shadow_zero_check = & context - > shadow_zero_check ;
2021-06-22 10:57:16 -07:00
__reset_rsvds_bits_mask ( shadow_zero_check , reserved_hpa_bits ( ) ,
2022-02-10 07:41:19 -05:00
context - > root_role . level ,
2022-02-14 08:46:24 -05:00
context - > root_role . efer_nx ,
2023-08-15 13:36:40 -07:00
guest_can_use ( vcpu , X86_FEATURE_GBPAGES ) ,
is_pse , is_amd ) ;
2017-08-25 15:55:40 -05:00
if ( ! shadow_me_mask )
return ;
2022-02-10 07:41:19 -05:00
for ( i = context - > root_role . level ; - - i > = 0 ; ) {
KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits. Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits. TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either. Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.
KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory. MKTME KeyID bits should be set
to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check. So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).
Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:
- shadow_me_value: the memory encryption bit(s) that will be set to the
SPTE (the original shadow_me_mask).
- shadow_me_mask: all possible memory encryption bits (which is a super
set of shadow_me_value).
- For now, shadow_me_value is supposed to be set by SVM and VMX
respectively, and it is a constant during KVM's life time. This
perhaps doesn't fit MKTME but for now host kernel doesn't support it
(and perhaps will never do).
- Bits in shadow_me_mask are set to shadow_zero_check, except the bits
in shadow_me_value.
Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed. This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-19 23:17:03 +12:00
/*
* So far shadow_me_value is a constant during KVM ' s life
* time . Bits in shadow_me_value are allowed to be set .
* Bits in shadow_me_mask but not in shadow_me_value are
* not allowed to be set .
*/
shadow_zero_check - > rsvd_bits_mask [ 0 ] [ i ] | = shadow_me_mask ;
shadow_zero_check - > rsvd_bits_mask [ 1 ] [ i ] | = shadow_me_mask ;
shadow_zero_check - > rsvd_bits_mask [ 0 ] [ i ] & = ~ shadow_me_value ;
shadow_zero_check - > rsvd_bits_mask [ 1 ] [ i ] & = ~ shadow_me_value ;
2017-08-25 15:55:40 -05:00
}
2015-08-05 12:04:24 +08:00
}
2015-09-22 23:02:14 +02:00
static inline bool boot_cpu_is_amd ( void )
{
WARN_ON_ONCE ( ! tdp_enabled ) ;
return shadow_x_mask = = 0 ;
}
2015-08-05 12:04:24 +08:00
/*
* the direct page table on host , use as much mmu features as
* possible , however , kvm currently does not do execution - protection .
*/
2023-02-02 18:27:51 +00:00
static void reset_tdp_shadow_zero_bits_mask ( struct kvm_mmu * context )
2015-08-05 12:04:24 +08:00
{
2017-08-25 15:55:40 -05:00
struct rsvd_bits_validate * shadow_zero_check ;
int i ;
shadow_zero_check = & context - > shadow_zero_check ;
2015-09-22 23:02:14 +02:00
if ( boot_cpu_is_amd ( ) )
2021-06-22 10:57:16 -07:00
__reset_rsvds_bits_mask ( shadow_zero_check , reserved_hpa_bits ( ) ,
2022-07-23 01:30:29 +00:00
context - > root_role . level , true ,
2016-03-29 17:41:58 +02:00
boot_cpu_has ( X86_FEATURE_GBPAGES ) ,
2021-06-22 10:57:14 -07:00
false , true ) ;
2015-08-05 12:04:24 +08:00
else
2017-08-25 15:55:40 -05:00
__reset_rsvds_bits_mask_ept ( shadow_zero_check ,
2021-11-24 20:20:48 +08:00
reserved_hpa_bits ( ) , false ,
max_huge_page_level ) ;
2015-08-05 12:04:24 +08:00
2017-08-25 15:55:40 -05:00
if ( ! shadow_me_mask )
return ;
2022-02-10 07:41:19 -05:00
for ( i = context - > root_role . level ; - - i > = 0 ; ) {
2017-08-25 15:55:40 -05:00
shadow_zero_check - > rsvd_bits_mask [ 0 ] [ i ] & = ~ shadow_me_mask ;
shadow_zero_check - > rsvd_bits_mask [ 1 ] [ i ] & = ~ shadow_me_mask ;
}
2015-08-05 12:04:24 +08:00
}
/*
* as the comments in reset_shadow_zero_bits_mask ( ) except it
* is the shadow page table for intel nested guest .
*/
static void
2022-01-25 17:58:53 +08:00
reset_ept_shadow_zero_bits_mask ( struct kvm_mmu * context , bool execonly )
2015-08-05 12:04:24 +08:00
{
__reset_rsvds_bits_mask_ept ( & context - > shadow_zero_check ,
2021-11-24 20:20:48 +08:00
reserved_hpa_bits ( ) , execonly ,
max_huge_page_level ) ;
2015-08-05 12:04:24 +08:00
}
2017-08-24 17:37:25 +02:00
# define BYTE_MASK(access) \
( ( 1 & ( access ) ? 2 : 0 ) | \
( 2 & ( access ) ? 4 : 0 ) | \
( 3 & ( access ) ? 8 : 0 ) | \
( 4 & ( access ) ? 16 : 0 ) | \
( 5 & ( access ) ? 32 : 0 ) | \
( 6 & ( access ) ? 64 : 0 ) | \
( 7 & ( access ) ? 128 : 0 ) )
2021-06-22 10:57:17 -07:00
static void update_permission_bitmask ( struct kvm_mmu * mmu , bool ept )
KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-12 14:52:00 +03:00
{
2017-08-24 17:37:25 +02:00
unsigned byte ;
const u8 x = BYTE_MASK ( ACC_EXEC_MASK ) ;
const u8 w = BYTE_MASK ( ACC_WRITE_MASK ) ;
const u8 u = BYTE_MASK ( ACC_USER_MASK ) ;
2021-06-22 10:57:17 -07:00
bool cr4_smep = is_cr4_smep ( mmu ) ;
bool cr4_smap = is_cr4_smap ( mmu ) ;
bool cr0_wp = is_cr0_wp ( mmu ) ;
2021-06-22 10:57:22 -07:00
bool efer_nx = is_efer_nx ( mmu ) ;
KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-12 14:52:00 +03:00
for ( byte = 0 ; byte < ARRAY_SIZE ( mmu - > permissions ) ; + + byte ) {
2017-08-24 17:37:25 +02:00
unsigned pfec = byte < < 1 ;
2014-04-01 17:46:34 +08:00
/*
2017-08-24 17:37:25 +02:00
* Each " *f " variable has a 1 bit for each UWX value
* that causes a fault with the given PFEC .
2014-04-01 17:46:34 +08:00
*/
KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-12 14:52:00 +03:00
2017-08-24 17:37:25 +02:00
/* Faults from writes to non-writable pages */
2019-07-12 11:12:30 +02:00
u8 wf = ( pfec & PFERR_WRITE_MASK ) ? ( u8 ) ~ w : 0 ;
2017-08-24 17:37:25 +02:00
/* Faults from user mode accesses to supervisor pages */
2019-07-12 11:12:30 +02:00
u8 uf = ( pfec & PFERR_USER_MASK ) ? ( u8 ) ~ u : 0 ;
2017-08-24 17:37:25 +02:00
/* Faults from fetches of non-executable pages*/
2019-07-12 11:12:30 +02:00
u8 ff = ( pfec & PFERR_FETCH_MASK ) ? ( u8 ) ~ x : 0 ;
2017-08-24 17:37:25 +02:00
/* Faults from kernel mode fetches of user pages */
u8 smepf = 0 ;
/* Faults from kernel mode accesses of user pages */
u8 smapf = 0 ;
if ( ! ept ) {
/* Faults from kernel mode accesses to user pages */
u8 kf = ( pfec & PFERR_USER_MASK ) ? 0 : u ;
/* Not really needed: !nx will cause pte.nx to fault */
2021-06-22 10:57:22 -07:00
if ( ! efer_nx )
2017-08-24 17:37:25 +02:00
ff = 0 ;
/* Allow supervisor writes if !cr0.wp */
if ( ! cr0_wp )
wf = ( pfec & PFERR_USER_MASK ) ? wf : 0 ;
/* Disallow supervisor fetches of user code if cr4.smep */
if ( cr4_smep )
smepf = ( pfec & PFERR_FETCH_MASK ) ? kf : 0 ;
/*
* SMAP : kernel - mode data accesses from user - mode
* mappings should fault . A fault is considered
* as a SMAP violation if all of the following
2018-10-04 11:45:00 -04:00
* conditions are true :
2017-08-24 17:37:25 +02:00
* - X86_CR4_SMAP is set in CR4
* - A user page is accessed
* - The access is not a fetch
2022-03-11 15:03:44 +08:00
* - The access is supervisor mode
* - If implicit supervisor access or X86_EFLAGS_AC is clear
2017-08-24 17:37:25 +02:00
*
2022-03-11 15:03:42 +08:00
* Here , we cover the first four conditions .
* The fifth is computed dynamically in permission_fault ( ) ;
2017-08-24 17:37:25 +02:00
* PFERR_RSVD_MASK bit will be set in PFEC if the access is
* * not * subject to SMAP restrictions .
*/
if ( cr4_smap )
smapf = ( pfec & ( PFERR_RSVD_MASK | PFERR_FETCH_MASK ) ) ? 0 : kf ;
KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-12 14:52:00 +03:00
}
2017-08-24 17:37:25 +02:00
mmu - > permissions [ byte ] = ff | uf | wf | smepf | smapf ;
KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-12 14:52:00 +03:00
}
}
2016-03-22 16:51:19 +08:00
/*
* PKU is an additional mechanism by which the paging controls access to
* user - mode addresses based on the value in the PKRU register . Protection
* key violations are reported through a bit in the page fault error code .
* Unlike other bits of the error code , the PK bit is not known at the
* call site of e . g . gva_to_gpa ; it must be computed directly in
* permission_fault based on two bits of PKRU , on some machine state ( CR4 ,
* CR0 , EFER , CPL ) , and on other bits of the error code and the page tables .
*
* In particular the following conditions come from the error code , the
* page tables and the machine state :
* - PK is always zero unless CR4 . PKE = 1 and EFER . LMA = 1
* - PK is always zero if RSVD = 1 ( reserved bit set ) or F = 1 ( instruction fetch )
* - PK is always zero if U = 0 in the page tables
* - PKRU . WD is ignored if CR0 . WP = 0 and the access is a supervisor access .
*
* The PKRU bitmask caches the result of these four conditions . The error
* code ( minus the P bit ) and the page table ' s U bit form an index into the
* PKRU bitmask . Two bits of the PKRU bitmask are then extracted and ANDed
* with the two bits of the PKRU register corresponding to the protection key .
* For the first three conditions above the bits will be 00 , thus masking
* away both AD and WD . For all reads or if the last condition holds , WD
* only will be masked away .
*/
2021-06-22 10:57:18 -07:00
static void update_pkru_bitmask ( struct kvm_mmu * mmu )
2016-03-22 16:51:19 +08:00
{
unsigned bit ;
bool wp ;
2021-10-21 15:10:22 +08:00
mmu - > pkru_mask = 0 ;
if ( ! is_cr4_pke ( mmu ) )
2016-03-22 16:51:19 +08:00
return ;
2021-06-22 10:57:18 -07:00
wp = is_cr0_wp ( mmu ) ;
2016-03-22 16:51:19 +08:00
for ( bit = 0 ; bit < ARRAY_SIZE ( mmu - > permissions ) ; + + bit ) {
unsigned pfec , pkey_bits ;
bool check_pkey , check_write , ff , uf , wf , pte_user ;
pfec = bit < < 1 ;
ff = pfec & PFERR_FETCH_MASK ;
uf = pfec & PFERR_USER_MASK ;
wf = pfec & PFERR_WRITE_MASK ;
/* PFEC.RSVD is replaced by ACC_USER_MASK. */
pte_user = pfec & PFERR_RSVD_MASK ;
/*
* Only need to check the access which is not an
* instruction fetch and is to a user page .
*/
check_pkey = ( ! ff & & pte_user ) ;
/*
* write access is controlled by PKRU if it is a
* user access or CR0 . WP = 1.
*/
check_write = check_pkey & & wf & & ( uf | | wp ) ;
/* PKRU.AD stops both read and write access. */
pkey_bits = ! ! check_pkey ;
/* PKRU.WD stops write access. */
pkey_bits | = ( ! ! check_write ) < < 1 ;
mmu - > pkru_mask | = ( pkey_bits & 3 ) < < pfec ;
}
}
2021-06-22 10:57:28 -07:00
static void reset_guest_paging_metadata ( struct kvm_vcpu * vcpu ,
struct kvm_mmu * mmu )
2012-09-12 20:46:56 +03:00
{
2021-06-22 10:57:28 -07:00
if ( ! is_cr0_pg ( mmu ) )
return ;
2016-02-23 12:51:19 +01:00
2022-04-19 23:17:02 +12:00
reset_guest_rsvds_bits_mask ( vcpu , mmu ) ;
2021-06-22 10:57:28 -07:00
update_permission_bitmask ( mmu , false ) ;
update_pkru_bitmask ( mmu ) ;
2012-09-12 20:46:56 +03:00
}
2021-06-22 10:57:30 -07:00
static void paging64_init_context ( struct kvm_mmu * context )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
context - > page_fault = paging64_page_fault ;
context - > gva_to_gpa = paging64_gva_to_gpa ;
2023-02-16 23:41:11 +08:00
context - > sync_spte = paging64_sync_spte ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2021-06-22 10:57:21 -07:00
static void paging32_init_context ( struct kvm_mmu * context )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
context - > page_fault = paging32_page_fault ;
context - > gva_to_gpa = paging32_gva_to_gpa ;
2023-02-16 23:41:11 +08:00
context - > sync_spte = paging32_sync_spte ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2023-02-02 18:27:51 +00:00
static union kvm_cpu_role kvm_calc_cpu_role ( struct kvm_vcpu * vcpu ,
const struct kvm_mmu_role_regs * regs )
2022-02-11 06:50:11 -05:00
{
2022-02-10 07:38:32 -05:00
union kvm_cpu_role role = { 0 } ;
2022-02-11 06:50:11 -05:00
role . base . access = ACC_ALL ;
role . base . smm = is_smm ( vcpu ) ;
role . base . guest_mode = is_guest_mode ( vcpu ) ;
role . ext . valid = 1 ;
if ( ! ____is_cr0_pg ( regs ) ) {
role . base . direct = 1 ;
return role ;
}
role . base . efer_nx = ____is_efer_nx ( regs ) ;
role . base . cr0_wp = ____is_cr0_wp ( regs ) ;
role . base . smep_andnot_wp = ____is_cr4_smep ( regs ) & & ! ____is_cr0_wp ( regs ) ;
role . base . smap_andnot_wp = ____is_cr4_smap ( regs ) & & ! ____is_cr0_wp ( regs ) ;
role . base . has_4_byte_gpte = ! ____is_cr4_pae ( regs ) ;
2022-02-10 07:32:40 -05:00
if ( ____is_efer_lma ( regs ) )
role . base . level = ____is_cr4_la57 ( regs ) ? PT64_ROOT_5LEVEL
: PT64_ROOT_4LEVEL ;
else if ( ____is_cr4_pae ( regs ) )
role . base . level = PT32E_ROOT_LEVEL ;
else
role . base . level = PT32_ROOT_LEVEL ;
2022-02-11 06:50:11 -05:00
role . ext . cr4_smep = ____is_cr4_smep ( regs ) ;
role . ext . cr4_smap = ____is_cr4_smap ( regs ) ;
role . ext . cr4_pse = ____is_cr4_pse ( regs ) ;
/* PKEY and LA57 are active iff long mode is active. */
role . ext . cr4_pke = ____is_efer_lma ( regs ) & & ____is_cr4_pke ( regs ) ;
role . ext . cr4_la57 = ____is_efer_lma ( regs ) & & ____is_cr4_la57 ( regs ) ;
role . ext . efer_lma = ____is_efer_lma ( regs ) ;
return role ;
}
2023-04-04 17:26:08 -07:00
void __kvm_mmu_refresh_passthrough_bits ( struct kvm_vcpu * vcpu ,
struct kvm_mmu * mmu )
{
const bool cr0_wp = kvm_is_cr0_bit_set ( vcpu , X86_CR0_WP ) ;
BUILD_BUG_ON ( ( KVM_MMU_CR0_ROLE_BITS & KVM_POSSIBLE_CR0_GUEST_BITS ) ! = X86_CR0_WP ) ;
BUILD_BUG_ON ( ( KVM_MMU_CR4_ROLE_BITS & KVM_POSSIBLE_CR4_GUEST_BITS ) ) ;
if ( is_cr0_wp ( mmu ) = = cr0_wp )
return ;
mmu - > cpu_role . base . cr0_wp = cr0_wp ;
reset_guest_paging_metadata ( vcpu , mmu ) ;
}
2020-07-15 20:41:20 -07:00
static inline int kvm_mmu_get_tdp_level ( struct kvm_vcpu * vcpu )
{
2021-08-18 11:55:47 -05:00
/* tdp_root_level is architecture forced level, use it if nonzero */
if ( tdp_root_level )
return tdp_root_level ;
2020-07-15 20:41:20 -07:00
/* Use 5-level TDP if and only if it's useful/necessary. */
2020-07-15 20:41:22 -07:00
if ( max_tdp_level = = 5 & & cpuid_maxphyaddr ( vcpu ) < = 48 )
2020-07-15 20:41:20 -07:00
return 4 ;
2020-07-15 20:41:22 -07:00
return max_tdp_level ;
2020-07-15 20:41:20 -07:00
}
2022-02-14 08:46:24 -05:00
static union kvm_mmu_page_role
2021-06-22 10:57:08 -07:00
kvm_calc_tdp_mmu_root_page_role ( struct kvm_vcpu * vcpu ,
2022-02-10 07:38:32 -05:00
union kvm_cpu_role cpu_role )
2018-06-27 14:59:07 -07:00
{
2022-02-14 08:46:24 -05:00
union kvm_mmu_page_role role = { 0 } ;
2018-06-27 14:59:07 -07:00
2022-02-14 08:46:24 -05:00
role . access = ACC_ALL ;
role . cr0_wp = true ;
role . efer_nx = true ;
role . smm = cpu_role . base . smm ;
role . guest_mode = cpu_role . base . guest_mode ;
KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path. Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.
In other words, don't attempt the fast path just because EPT is enabled.
Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios. Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.
The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults. Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.
On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles. So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.
Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:44 +00:00
role . ad_disabled = ! kvm_ad_enabled ( ) ;
2022-02-14 08:46:24 -05:00
role . level = kvm_mmu_get_tdp_level ( vcpu ) ;
role . direct = true ;
role . has_4_byte_gpte = false ;
2018-06-27 14:59:07 -07:00
return role ;
}
2022-02-10 07:30:31 -05:00
static void init_kvm_tdp_mmu ( struct kvm_vcpu * vcpu ,
2022-02-10 07:57:21 -05:00
union kvm_cpu_role cpu_role )
2008-02-07 13:47:44 +01:00
{
2020-07-10 16:11:50 +02:00
struct kvm_mmu * context = & vcpu - > arch . root_mmu ;
2022-02-14 08:46:24 -05:00
union kvm_mmu_page_role root_role = kvm_calc_tdp_mmu_root_page_role ( vcpu , cpu_role ) ;
2008-02-07 13:47:44 +01:00
2022-02-11 06:50:11 -05:00
if ( cpu_role . as_u64 = = context - > cpu_role . as_u64 & &
2022-02-14 08:46:24 -05:00
root_role . word = = context - > root_role . word )
2018-10-08 21:28:12 +02:00
return ;
2022-02-11 06:50:11 -05:00
context - > cpu_role . as_u64 = cpu_role . as_u64 ;
2022-02-14 08:46:24 -05:00
context - > root_role . word = root_role . word ;
2020-02-06 14:14:34 -08:00
context - > page_fault = kvm_tdp_page_fault ;
2023-02-16 23:41:11 +08:00
context - > sync_spte = NULL ;
2023-03-22 02:37:26 +01:00
context - > get_guest_pgd = get_guest_cr3 ;
2011-07-28 11:36:17 +03:00
context - > get_pdptr = kvm_pdptr_read ;
2010-09-10 17:30:43 +02:00
context - > inject_page_fault = kvm_inject_page_fault ;
2008-02-07 13:47:44 +01:00
2021-06-22 10:57:31 -07:00
if ( ! is_cr0_pg ( context ) )
2008-02-07 13:47:44 +01:00
context - > gva_to_gpa = nonpaging_gva_to_gpa ;
2021-06-22 10:57:31 -07:00
else if ( is_cr4_pae ( context ) )
2012-03-05 16:53:06 +01:00
context - > gva_to_gpa = paging64_gva_to_gpa ;
2021-06-22 10:57:29 -07:00
else
2012-03-05 16:53:06 +01:00
context - > gva_to_gpa = paging32_gva_to_gpa ;
2008-02-07 13:47:44 +01:00
2021-06-22 10:57:28 -07:00
reset_guest_paging_metadata ( vcpu , context ) ;
2022-01-25 17:58:53 +08:00
reset_tdp_shadow_zero_bits_mask ( context ) ;
2008-02-07 13:47:44 +01:00
}
2020-07-10 16:11:50 +02:00
static void shadow_mmu_init_context ( struct kvm_vcpu * vcpu , struct kvm_mmu * context ,
2022-02-10 07:38:32 -05:00
union kvm_cpu_role cpu_role ,
2022-02-14 08:46:24 -05:00
union kvm_mmu_page_role root_role )
2018-06-27 14:59:07 -07:00
{
2022-02-11 06:50:11 -05:00
if ( cpu_role . as_u64 = = context - > cpu_role . as_u64 & &
2022-02-14 08:46:24 -05:00
root_role . word = = context - > root_role . word )
2021-06-22 10:57:13 -07:00
return ;
2008-12-21 19:20:09 +02:00
2022-02-11 06:50:11 -05:00
context - > cpu_role . as_u64 = cpu_role . as_u64 ;
2022-02-14 08:46:24 -05:00
context - > root_role . word = root_role . word ;
2021-06-22 10:57:13 -07:00
2021-06-22 10:57:31 -07:00
if ( ! is_cr0_pg ( context ) )
2021-06-22 10:57:21 -07:00
nonpaging_init_context ( context ) ;
2021-06-22 10:57:31 -07:00
else if ( is_cr4_pae ( context ) )
2021-06-22 10:57:30 -07:00
paging64_init_context ( context ) ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
else
2021-06-22 10:57:21 -07:00
paging32_init_context ( context ) ;
2008-12-21 19:20:09 +02:00
2021-06-22 10:57:28 -07:00
reset_guest_paging_metadata ( vcpu , context ) ;
2015-08-05 12:04:24 +08:00
reset_shadow_zero_bits_mask ( vcpu , context ) ;
2010-09-10 17:30:44 +02:00
}
2020-07-10 16:11:49 +02:00
2021-06-22 10:57:05 -07:00
static void kvm_init_shadow_mmu ( struct kvm_vcpu * vcpu ,
2022-02-10 07:57:21 -05:00
union kvm_cpu_role cpu_role )
2020-07-10 16:11:49 +02:00
{
2020-07-10 16:11:50 +02:00
struct kvm_mmu * context = & vcpu - > arch . root_mmu ;
2022-02-10 07:39:50 -05:00
union kvm_mmu_page_role root_role ;
2020-07-10 16:11:49 +02:00
2022-02-10 07:39:50 -05:00
root_role = cpu_role . base ;
2020-07-10 16:11:49 +02:00
2022-02-10 07:39:50 -05:00
/* KVM uses PAE paging whenever the guest isn't using 64-bit paging. */
root_role . level = max_t ( u32 , root_role . level , PT32E_ROOT_LEVEL ) ;
2020-07-15 20:41:15 -07:00
2022-02-10 07:39:50 -05:00
/*
* KVM forces EFER . NX = 1 when TDP is disabled , reflect it in the MMU role .
* KVM uses NX when TDP is disabled to handle a variety of scenarios ,
* notably for huge SPTEs if iTLB multi - hit mitigation is enabled and
* to generate correct permissions for CR0 . WP = 0 / CR4 . SMEP = 1 / EFER . NX = 0.
* The iTLB multi - hit workaround can be toggled at any time , so assume
* NX can be used by any non - nested shadow MMU to avoid having to reset
* MMU contexts .
*/
root_role . efer_nx = true ;
shadow_mmu_init_context ( vcpu , context , cpu_role , root_role ) ;
2020-07-15 20:41:15 -07:00
}
2021-06-22 10:56:59 -07:00
void kvm_init_shadow_npt_mmu ( struct kvm_vcpu * vcpu , unsigned long cr0 ,
unsigned long cr4 , u64 efer , gpa_t nested_cr3 )
2020-07-10 16:11:49 +02:00
{
2020-07-10 16:11:50 +02:00
struct kvm_mmu * context = & vcpu - > arch . guest_mmu ;
2021-06-22 10:57:05 -07:00
struct kvm_mmu_role_regs regs = {
. cr0 = cr0 ,
2021-11-22 13:01:37 -05:00
. cr4 = cr4 & ~ X86_CR4_PKE ,
2021-06-22 10:57:05 -07:00
. efer = efer ,
} ;
2022-02-10 07:38:32 -05:00
union kvm_cpu_role cpu_role = kvm_calc_cpu_role ( vcpu , & regs ) ;
2022-02-10 07:39:50 -05:00
union kvm_mmu_page_role root_role ;
/* NPT requires CR0.PG=1. */
WARN_ON_ONCE ( cpu_role . base . direct ) ;
root_role = cpu_role . base ;
root_role . level = kvm_mmu_get_tdp_level ( vcpu ) ;
2022-04-20 21:12:04 +08:00
if ( root_role . level = = PT64_ROOT_5LEVEL & &
cpu_role . base . level = = PT64_ROOT_4LEVEL )
root_role . passthrough = 1 ;
2020-07-10 16:11:55 +02:00
2022-02-14 08:46:24 -05:00
shadow_mmu_init_context ( vcpu , context , cpu_role , root_role ) ;
2021-11-22 13:18:23 -05:00
kvm_mmu_new_pgd ( vcpu , nested_cr3 ) ;
2020-07-10 16:11:49 +02:00
}
EXPORT_SYMBOL_GPL ( kvm_init_shadow_npt_mmu ) ;
2010-09-10 17:30:44 +02:00
2022-02-10 07:38:32 -05:00
static union kvm_cpu_role
2018-10-08 21:28:11 +02:00
kvm_calc_shadow_ept_root_page_role ( struct kvm_vcpu * vcpu , bool accessed_dirty ,
2020-03-02 18:02:36 -08:00
bool execonly , u8 level )
2018-06-27 14:59:07 -07:00
{
2022-02-10 07:38:32 -05:00
union kvm_cpu_role role = { 0 } ;
2018-10-08 21:28:08 +02:00
2022-02-10 07:30:08 -05:00
/*
* KVM does not support SMM transfer monitors , and consequently does not
* support the " entry to SMM " control either . role . base . smm is always 0.
*/
WARN_ON_ONCE ( is_smm ( vcpu ) ) ;
2020-03-02 18:02:36 -08:00
role . base . level = level ;
2021-11-24 20:20:51 +08:00
role . base . has_4_byte_gpte = false ;
2018-10-08 21:28:11 +02:00
role . base . direct = false ;
role . base . ad_disabled = ! accessed_dirty ;
role . base . guest_mode = true ;
role . base . access = ACC_ALL ;
2018-06-27 14:59:07 -07:00
2021-06-22 10:57:07 -07:00
role . ext . word = 0 ;
2018-10-08 21:28:11 +02:00
role . ext . execonly = execonly ;
2021-06-22 10:57:07 -07:00
role . ext . valid = 1 ;
2018-06-27 14:59:07 -07:00
return role ;
}
2017-03-30 11:55:30 +02:00
void kvm_init_shadow_ept_mmu ( struct kvm_vcpu * vcpu , bool execonly ,
2021-11-24 20:20:49 +08:00
int huge_page_level , bool accessed_dirty ,
gpa_t new_eptp )
2013-08-05 11:07:16 +03:00
{
2020-07-10 16:11:50 +02:00
struct kvm_mmu * context = & vcpu - > arch . guest_mmu ;
2020-03-02 18:02:36 -08:00
u8 level = vmx_eptp_page_walk_level ( new_eptp ) ;
2022-02-10 07:38:32 -05:00
union kvm_cpu_role new_mode =
2018-10-08 21:28:11 +02:00
kvm_calc_shadow_ept_root_page_role ( vcpu , accessed_dirty ,
2020-03-02 18:02:36 -08:00
execonly , level ) ;
2018-10-08 21:28:11 +02:00
2022-02-11 06:50:11 -05:00
if ( new_mode . as_u64 ! = context - > cpu_role . as_u64 ) {
/* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */
context - > cpu_role . as_u64 = new_mode . as_u64 ;
2022-02-14 08:46:24 -05:00
context - > root_role . word = new_mode . base . word ;
2022-02-04 04:12:31 -05:00
context - > page_fault = ept_page_fault ;
context - > gva_to_gpa = ept_gva_to_gpa ;
2023-02-16 23:41:11 +08:00
context - > sync_spte = ept_sync_spte ;
2022-02-10 08:00:56 -05:00
2022-02-04 04:12:31 -05:00
update_permission_bitmask ( context , true ) ;
context - > pkru_mask = 0 ;
reset_rsvds_bits_mask_ept ( vcpu , context , execonly , huge_page_level ) ;
reset_ept_shadow_zero_bits_mask ( context , execonly ) ;
}
2018-10-08 21:28:06 +02:00
2021-11-22 13:18:23 -05:00
kvm_mmu_new_pgd ( vcpu , new_eptp ) ;
2013-08-05 11:07:16 +03:00
}
EXPORT_SYMBOL_GPL ( kvm_init_shadow_ept_mmu ) ;
2022-02-10 07:30:31 -05:00
static void init_kvm_softmmu ( struct kvm_vcpu * vcpu ,
2022-02-10 07:57:21 -05:00
union kvm_cpu_role cpu_role )
2010-09-10 17:30:44 +02:00
{
2020-07-10 16:11:50 +02:00
struct kvm_mmu * context = & vcpu - > arch . root_mmu ;
2013-10-02 16:56:14 +02:00
2022-02-10 07:57:21 -05:00
kvm_init_shadow_mmu ( vcpu , cpu_role ) ;
2020-05-19 06:18:31 -04:00
2023-03-22 02:37:26 +01:00
context - > get_guest_pgd = get_guest_cr3 ;
2013-10-02 16:56:14 +02:00
context - > get_pdptr = kvm_pdptr_read ;
context - > inject_page_fault = kvm_inject_page_fault ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2022-02-10 07:30:31 -05:00
static void init_kvm_nested_mmu ( struct kvm_vcpu * vcpu ,
2022-02-10 07:57:21 -05:00
union kvm_cpu_role new_mode )
2010-09-10 17:30:54 +02:00
{
struct kvm_mmu * g_context = & vcpu - > arch . nested_mmu ;
2022-02-11 06:50:11 -05:00
if ( new_mode . as_u64 = = g_context - > cpu_role . as_u64 )
2018-10-08 21:28:13 +02:00
return ;
2022-02-11 06:50:11 -05:00
g_context - > cpu_role . as_u64 = new_mode . as_u64 ;
2023-03-22 02:37:26 +01:00
g_context - > get_guest_pgd = get_guest_cr3 ;
2011-07-28 11:36:17 +03:00
g_context - > get_pdptr = kvm_pdptr_read ;
2010-09-10 17:30:54 +02:00
g_context - > inject_page_fault = kvm_inject_page_fault ;
2020-03-23 20:42:57 -04:00
/*
* L2 page tables are never shadowed , so there is no need to sync
* SPTEs .
*/
KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB instead.
In hardware TLB, invalidating TLB entries means the translations are
removed from the TLB.
In KVM shadowed vTLB, the translations (combinations of shadow paging
and hardware TLB) are generally maintained as long as they remain "clean"
when the TLB of an address space (i.e. a PCID or all) is flushed with
the help of write-protections, sp->unsync, and kvm_sync_page(), where
"clean" in this context means that no updates to KVM's SPTEs are needed.
However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is
unsync, and thus triggers a remote flush even if the original vTLB entry
is clean, i.e. is usable as-is.
Besides this, FNAME(invlpg) is largely is a duplicate implementation of
FNAME(sync_spte) to invalidate a vTLB entry.
To address both issues, reuse FNAME(sync_spte) to share the code and
slightly modify the semantics, i.e. keep the vTLB entry if it's "clean"
and avoid remote TLB flush.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-17 07:53:19 +08:00
g_context - > sync_spte = NULL ;
2020-03-23 20:42:57 -04:00
2010-09-10 17:30:54 +02:00
/*
2018-10-08 21:28:05 +02:00
* Note that arch . mmu - > gva_to_gpa translates l2_gpa to l1_gpa using
2015-12-30 08:26:17 -08:00
* L1 ' s nested page tables ( e . g . EPT12 ) . The nested translation
* of l2_gva to l1_gpa is done by arch . nested_mmu . gva_to_gpa using
* L2 ' s page tables as the first level of translation and L1 ' s
* nested page tables as the second level of translation . Basically
* the gva_to_gpa functions between mmu and nested_mmu are swapped .
2010-09-10 17:30:54 +02:00
*/
2021-06-22 10:57:26 -07:00
if ( ! is_paging ( vcpu ) )
2021-11-24 20:20:44 +08:00
g_context - > gva_to_gpa = nonpaging_gva_to_gpa ;
2021-06-22 10:57:26 -07:00
else if ( is_long_mode ( vcpu ) )
2021-11-24 20:20:44 +08:00
g_context - > gva_to_gpa = paging64_gva_to_gpa ;
2021-06-22 10:57:26 -07:00
else if ( is_pae ( vcpu ) )
2021-11-24 20:20:44 +08:00
g_context - > gva_to_gpa = paging64_gva_to_gpa ;
2021-06-22 10:57:26 -07:00
else
2021-11-24 20:20:44 +08:00
g_context - > gva_to_gpa = paging32_gva_to_gpa ;
2010-09-10 17:30:54 +02:00
2021-06-22 10:57:28 -07:00
reset_guest_paging_metadata ( vcpu , g_context ) ;
2010-09-10 17:30:54 +02:00
}
2021-06-09 16:42:33 -07:00
void kvm_init_mmu ( struct kvm_vcpu * vcpu )
2008-02-07 13:47:44 +01:00
{
2022-02-10 07:30:31 -05:00
struct kvm_mmu_role_regs regs = vcpu_to_role_regs ( vcpu ) ;
2022-02-10 07:57:21 -05:00
union kvm_cpu_role cpu_role = kvm_calc_cpu_role ( vcpu , & regs ) ;
2022-02-10 07:30:31 -05:00
2010-09-10 17:30:54 +02:00
if ( mmu_is_nested ( vcpu ) )
2022-02-10 07:57:21 -05:00
init_kvm_nested_mmu ( vcpu , cpu_role ) ;
2010-09-10 17:30:54 +02:00
else if ( tdp_enabled )
2022-02-10 07:57:21 -05:00
init_kvm_tdp_mmu ( vcpu , cpu_role ) ;
2008-02-07 13:47:44 +01:00
else
2022-02-10 07:57:21 -05:00
init_kvm_softmmu ( vcpu , cpu_role ) ;
2008-02-07 13:47:44 +01:00
}
2018-06-27 14:59:10 -07:00
EXPORT_SYMBOL_GPL ( kvm_init_mmu ) ;
2008-02-07 13:47:44 +01:00
KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified
Invalidate all MMUs' roles after a CPUID update to force reinitizliation
of the MMU context/helpers. Despite the efforts of commit de3ccd26fafc
("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"),
there are still a handful of CPUID-based properties that affect MMU
behavior but are not incorporated into mmu_role. E.g. 1gb hugepage
support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all
factor into the guest's reserved PTE bits.
The obvious alternative would be to add all such properties to mmu_role,
but doing so provides no benefit over simply forcing a reinitialization
on every CPUID update, as setting guest CPUID is a rare operation.
Note, reinitializing all MMUs after a CPUID update does not fix all of
KVM's woes. Specifically, kvm_mmu_page_role doesn't track the CPUID
properties, which means that a vCPU can reuse shadow pages that should
not exist for the new vCPU model, e.g. that map GPAs that are now illegal
(due to MAXPHYADDR changes) or that set bits that are now reserved
(PAGE_SIZE for 1gb pages), etc...
Tracking the relevant CPUID properties in kvm_mmu_page_role would address
the majority of problems, but fully tracking that much state in the
shadow page role comes with an unpalatable cost as it would require a
non-trivial increase in KVM's memory footprint. The GBPAGES case is even
worse, as neither Intel nor AMD provides a way to disable 1gb hugepage
support in the hardware page walker, i.e. it's a virtualization hole that
can't be closed when using TDP.
In other words, resetting the MMU after a CPUID update is largely a
superficial fix. But, it will allow reverting the tracking of MAXPHYADDR
in the mmu_role, and that case in particular needs to mostly work because
KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging
is supported. For cases where KVM botches guest behavior, the damage is
limited to that guest. But for the shadow_root_level, a misconfigured
MMU can cause KVM to incorrectly access memory, e.g. due to walking off
the end of its shadow page tables.
Fixes: 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-22 10:56:51 -07:00
void kvm_mmu_after_set_cpuid ( struct kvm_vcpu * vcpu )
{
/*
* Invalidate all MMU roles to force them to reinitialize as CPUID
* information is factored into reserved bit calculations .
2021-11-22 18:58:18 +01:00
*
* Correctly handling multiple vCPU models with respect to paging and
* physical address properties ) in a single VM would require tracking
* all relevant CPUID information in kvm_mmu_page_role . That is very
* undesirable as it would increase the memory requirements for
KVM: x86/mmu: Drop infrastructure for multiple page-track modes
Drop "support" for multiple page-track modes, as there is no evidence
that array-based and refcounted metadata is the optimal solution for
other modes, nor is there any evidence that other use cases, e.g. for
access-tracking, will be a good fit for the page-track machinery in
general.
E.g. one potential use case of access-tracking would be to prevent guest
access to poisoned memory (from the guest's perspective). In that case,
the number of poisoned pages is likely to be a very small percentage of
the guest memory, and there is no need to reference count the number of
access-tracking users, i.e. expanding gfn_track[] for a new mode would be
grossly inefficient. And for poisoned memory, host userspace would also
likely want to trap accesses, e.g. to inject #MC into the guest, and that
isn't currently supported by the page-track framework.
A better alternative for that poisoned page use case is likely a
variation of the proposed per-gfn attributes overlay (linked), which
would allow efficiently tracking the sparse set of poisoned pages, and by
default would exit to userspace on access.
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Ben Gardon <bgardon@google.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 18:35:29 -07:00
* gfn_write_track ( see struct kvm_mmu_page_role comments ) . For now
* that problem is swept under the rug ; KVM ' s CPUID API is horrific and
2021-11-22 18:58:18 +01:00
* it ' s all but impossible to solve it without introducing a new API .
KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified
Invalidate all MMUs' roles after a CPUID update to force reinitizliation
of the MMU context/helpers. Despite the efforts of commit de3ccd26fafc
("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"),
there are still a handful of CPUID-based properties that affect MMU
behavior but are not incorporated into mmu_role. E.g. 1gb hugepage
support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all
factor into the guest's reserved PTE bits.
The obvious alternative would be to add all such properties to mmu_role,
but doing so provides no benefit over simply forcing a reinitialization
on every CPUID update, as setting guest CPUID is a rare operation.
Note, reinitializing all MMUs after a CPUID update does not fix all of
KVM's woes. Specifically, kvm_mmu_page_role doesn't track the CPUID
properties, which means that a vCPU can reuse shadow pages that should
not exist for the new vCPU model, e.g. that map GPAs that are now illegal
(due to MAXPHYADDR changes) or that set bits that are now reserved
(PAGE_SIZE for 1gb pages), etc...
Tracking the relevant CPUID properties in kvm_mmu_page_role would address
the majority of problems, but fully tracking that much state in the
shadow page role comes with an unpalatable cost as it would require a
non-trivial increase in KVM's memory footprint. The GBPAGES case is even
worse, as neither Intel nor AMD provides a way to disable 1gb hugepage
support in the hardware page walker, i.e. it's a virtualization hole that
can't be closed when using TDP.
In other words, resetting the MMU after a CPUID update is largely a
superficial fix. But, it will allow reverting the tracking of MAXPHYADDR
in the mmu_role, and that case in particular needs to mostly work because
KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging
is supported. For cases where KVM botches guest behavior, the damage is
limited to that guest. But for the shadow_root_level, a misconfigured
MMU can cause KVM to incorrectly access memory, e.g. due to walking off
the end of its shadow page tables.
Fixes: 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-22 10:56:51 -07:00
*/
2024-04-08 16:11:15 -07:00
vcpu - > arch . root_mmu . root_role . invalid = 1 ;
vcpu - > arch . guest_mmu . root_role . invalid = 1 ;
vcpu - > arch . nested_mmu . root_role . invalid = 1 ;
2022-02-11 06:50:11 -05:00
vcpu - > arch . root_mmu . cpu_role . ext . valid = 0 ;
vcpu - > arch . guest_mmu . cpu_role . ext . valid = 0 ;
vcpu - > arch . nested_mmu . cpu_role . ext . valid = 0 ;
KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified
Invalidate all MMUs' roles after a CPUID update to force reinitizliation
of the MMU context/helpers. Despite the efforts of commit de3ccd26fafc
("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"),
there are still a handful of CPUID-based properties that affect MMU
behavior but are not incorporated into mmu_role. E.g. 1gb hugepage
support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all
factor into the guest's reserved PTE bits.
The obvious alternative would be to add all such properties to mmu_role,
but doing so provides no benefit over simply forcing a reinitialization
on every CPUID update, as setting guest CPUID is a rare operation.
Note, reinitializing all MMUs after a CPUID update does not fix all of
KVM's woes. Specifically, kvm_mmu_page_role doesn't track the CPUID
properties, which means that a vCPU can reuse shadow pages that should
not exist for the new vCPU model, e.g. that map GPAs that are now illegal
(due to MAXPHYADDR changes) or that set bits that are now reserved
(PAGE_SIZE for 1gb pages), etc...
Tracking the relevant CPUID properties in kvm_mmu_page_role would address
the majority of problems, but fully tracking that much state in the
shadow page role comes with an unpalatable cost as it would require a
non-trivial increase in KVM's memory footprint. The GBPAGES case is even
worse, as neither Intel nor AMD provides a way to disable 1gb hugepage
support in the hardware page walker, i.e. it's a virtualization hole that
can't be closed when using TDP.
In other words, resetting the MMU after a CPUID update is largely a
superficial fix. But, it will allow reverting the tracking of MAXPHYADDR
in the mmu_role, and that case in particular needs to mostly work because
KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging
is supported. For cases where KVM botches guest behavior, the damage is
limited to that guest. But for the shadow_root_level, a misconfigured
MMU can cause KVM to incorrectly access memory, e.g. due to walking off
the end of its shadow page tables.
Fixes: 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-22 10:56:51 -07:00
kvm_mmu_reset_context ( vcpu ) ;
KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken
Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
instability. Initialize last_vmentry_cpu to -1 and use it to detect if
the vCPU has been run at least once when its CPUID model is changed.
KVM does not correctly handle changes to paging related settings in the
guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc... KVM
could theoretically zap all shadow pages, but actually making that happen
is a mess due to lock inversion (vcpu->mutex is held). And even then,
updating paging settings on the fly would only work if all vCPUs are
stopped, updated in concert with identical settings, then restarted.
To support running vCPUs with different vCPU models (that affect paging),
KVM would need to track all relevant information in kvm_mmu_page_role.
Note, that's the _page_ role, not the full mmu_role. Updating mmu_role
isn't sufficient as a vCPU can reuse a shadow page translation that was
created by a vCPU with different settings and thus completely skip the
reserved bit checks (that are tied to CPUID).
Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
it would require doubling gfn_track from a u16 to a u32, i.e. would
increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
would all need to be tracked.
In practice, there is no remotely sane use case for changing any paging
related CPUID entries on the fly, so just sweep it under the rug (after
yelling at userspace).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-22 10:56:52 -07:00
/*
2021-11-22 18:58:18 +01:00
* Changing guest CPUID after KVM_RUN is forbidden , see the comment in
* kvm_arch_vcpu_ioctl ( ) .
KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken
Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
instability. Initialize last_vmentry_cpu to -1 and use it to detect if
the vCPU has been run at least once when its CPUID model is changed.
KVM does not correctly handle changes to paging related settings in the
guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc... KVM
could theoretically zap all shadow pages, but actually making that happen
is a mess due to lock inversion (vcpu->mutex is held). And even then,
updating paging settings on the fly would only work if all vCPUs are
stopped, updated in concert with identical settings, then restarted.
To support running vCPUs with different vCPU models (that affect paging),
KVM would need to track all relevant information in kvm_mmu_page_role.
Note, that's the _page_ role, not the full mmu_role. Updating mmu_role
isn't sufficient as a vCPU can reuse a shadow page translation that was
created by a vCPU with different settings and thus completely skip the
reserved bit checks (that are tied to CPUID).
Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
it would require doubling gfn_track from a u16 to a u32, i.e. would
increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
would all need to be tracked.
In practice, there is no remotely sane use case for changing any paging
related CPUID entries on the fly, so just sweep it under the rug (after
yelling at userspace).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-22 10:56:52 -07:00
*/
2023-03-10 16:45:59 -08:00
KVM_BUG_ON ( kvm_vcpu_has_run ( vcpu ) , vcpu - > kvm ) ;
KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified
Invalidate all MMUs' roles after a CPUID update to force reinitizliation
of the MMU context/helpers. Despite the efforts of commit de3ccd26fafc
("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"),
there are still a handful of CPUID-based properties that affect MMU
behavior but are not incorporated into mmu_role. E.g. 1gb hugepage
support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all
factor into the guest's reserved PTE bits.
The obvious alternative would be to add all such properties to mmu_role,
but doing so provides no benefit over simply forcing a reinitialization
on every CPUID update, as setting guest CPUID is a rare operation.
Note, reinitializing all MMUs after a CPUID update does not fix all of
KVM's woes. Specifically, kvm_mmu_page_role doesn't track the CPUID
properties, which means that a vCPU can reuse shadow pages that should
not exist for the new vCPU model, e.g. that map GPAs that are now illegal
(due to MAXPHYADDR changes) or that set bits that are now reserved
(PAGE_SIZE for 1gb pages), etc...
Tracking the relevant CPUID properties in kvm_mmu_page_role would address
the majority of problems, but fully tracking that much state in the
shadow page role comes with an unpalatable cost as it would require a
non-trivial increase in KVM's memory footprint. The GBPAGES case is even
worse, as neither Intel nor AMD provides a way to disable 1gb hugepage
support in the hardware page walker, i.e. it's a virtualization hole that
can't be closed when using TDP.
In other words, resetting the MMU after a CPUID update is largely a
superficial fix. But, it will allow reverting the tracking of MAXPHYADDR
in the mmu_role, and that case in particular needs to mostly work because
KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging
is supported. For cases where KVM botches guest behavior, the damage is
limited to that guest. But for the shadow_root_level, a misconfigured
MMU can cause KVM to incorrectly access memory, e.g. due to walking off
the end of its shadow page tables.
Fixes: 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-22 10:56:51 -07:00
}
2013-10-02 16:56:13 +02:00
void kvm_mmu_reset_context ( struct kvm_vcpu * vcpu )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
2013-10-02 16:56:12 +02:00
kvm_mmu_unload ( vcpu ) ;
2021-06-09 16:42:33 -07:00
kvm_init_mmu ( vcpu ) ;
2007-06-04 15:58:30 +03:00
}
2007-10-10 14:26:45 +08:00
EXPORT_SYMBOL_GPL ( kvm_mmu_reset_context ) ;
2007-06-04 15:58:30 +03:00
int kvm_mmu_load ( struct kvm_vcpu * vcpu )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
2007-01-05 16:36:53 -08:00
int r ;
2022-02-10 08:00:56 -05:00
r = mmu_topup_memory_caches ( vcpu , ! vcpu - > arch . mmu - > root_role . direct ) ;
2007-06-04 15:58:30 +03:00
if ( r )
goto out ;
2021-03-04 17:10:49 -08:00
r = mmu_alloc_special_roots ( vcpu ) ;
2007-06-04 15:58:30 +03:00
if ( r )
goto out ;
2022-02-10 08:00:56 -05:00
if ( vcpu - > arch . mmu - > root_role . direct )
2021-03-04 17:10:50 -08:00
r = mmu_alloc_direct_roots ( vcpu ) ;
else
r = mmu_alloc_shadow_roots ( vcpu ) ;
2009-05-12 18:55:45 -03:00
if ( r )
goto out ;
2021-03-04 17:11:00 -08:00
kvm_mmu_sync_roots ( vcpu ) ;
2020-03-05 03:52:50 -05:00
kvm_mmu_load_pgd ( vcpu ) ;
2022-02-26 00:15:22 +00:00
/*
* Flush any TLB entries for the new root , the provenance of the root
* is unknown . Even if KVM ensures there are no stale TLB entries
* for a freed root , in theory another hypervisor could have left
* stale entries . Flushing on alloc also allows KVM to skip the TLB
* flush when freeing a root ( see kvm_tdp_mmu_put_root ( ) ) .
*/
KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names
Rename a variety of kvm_x86_op function pointers so that preferred name
for vendor implementations follows the pattern <vendor>_<function>, e.g.
rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will
allow vendor implementations to be wired up via the KVM_X86_OP macro.
In many cases, VMX and SVM "disagree" on the preferred name, though in
reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
the kvm_x86_ops name. Justification for using the VMX nomenclature:
- set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
event that has already been "set" in e.g. the vIRR. SVM's relevant
VMCB field is even named event_inj, and KVM's stat is irq_injections.
- prepare_guest_switch => prepare_switch_to_guest because the former is
ambiguous, e.g. it could mean switching between multiple guests,
switching from the guest to host, etc...
- update_pi_irte => pi_update_irte to allow for matching match the rest
of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
- start_assignment => pi_start_assignment to again follow VMX's posted
interrupt naming scheme, and to provide context for what bit of code
might care about an otherwise undescribed "assignment".
The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave
that change for another time as the Hyper-V hooks always start as NULL,
i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
names requires an astounding amount of churn.
VMX and SVM function names are intentionally left as is to minimize the
diff. Both VMX and SVM will need to rename even more functions in order
to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
inevitable.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-28 00:51:50 +00:00
static_call ( kvm_x86_flush_tlb_current ) ( vcpu ) ;
2007-01-05 16:36:53 -08:00
out :
return r ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2007-06-04 15:58:30 +03:00
void kvm_mmu_unload ( struct kvm_vcpu * vcpu )
{
2022-02-21 09:31:51 -05:00
struct kvm * kvm = vcpu - > kvm ;
kvm_mmu_free_roots ( kvm , & vcpu - > arch . root_mmu , KVM_MMU_ROOTS_ALL ) ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( VALID_PAGE ( vcpu - > arch . root_mmu . root . hpa ) ) ;
2022-02-21 09:31:51 -05:00
kvm_mmu_free_roots ( kvm , & vcpu - > arch . guest_mmu , KVM_MMU_ROOTS_ALL ) ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( VALID_PAGE ( vcpu - > arch . guest_mmu . root . hpa ) ) ;
2022-02-14 09:13:48 -05:00
vcpu_clear_mmio_info ( vcpu , MMIO_GVA_ANY ) ;
2007-06-04 15:58:30 +03:00
}
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
2022-02-25 18:22:45 +00:00
static bool is_obsolete_root ( struct kvm * kvm , hpa_t root_hpa )
{
struct kvm_mmu_page * sp ;
if ( ! VALID_PAGE ( root_hpa ) )
return false ;
/*
* When freeing obsolete roots , treat roots as obsolete if they don ' t
KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
When attempting to allocate a shadow root for a !visible guest root gfn,
e.g. that resides in MMIO space, load a dummy root that is backed by the
zero page instead of immediately synthesizing a triple fault shutdown
(using the zero page ensures any attempt to translate memory will generate
a !PRESENT fault and thus VM-Exit).
Unless the vCPU is racing with memslot activity, KVM will inject a page
fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e.
the end result is mostly same, but critically KVM will inject a fault only
*after* KVM runs the vCPU with the bogus root.
Waiting to inject a fault until after running the vCPU fixes a bug where
KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled
and a !visible root. Even though a bad root will *probably* lead to
shutdown, (a) it's not guaranteed and (b) the CPU won't read the
underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with
a VMX preemption timer value of '0', then architecturally the preemption
timer VM-Exit is guaranteed to occur before the CPU executes any
instruction, i.e. before the CPU needs to translate a GPA to a HPA (so
long as there are no injected events with higher priority than the
preemption timer).
If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because
userspace created a memslot between installing the dummy root and handling
the page fault, simply unload the MMU to allocate a new root and retry the
instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as
invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and
conceptually the dummy root has indeeed become obsolete. The only
difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is
that the root has become obsolete due to memslot *creation*, not memslot
deletion or movement.
Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:52:00 -07:00
* have an associated shadow page , as it ' s impossible to determine if
* such roots are fresh or stale . This does mean KVM will get false
2022-02-25 18:22:45 +00:00
* positives and free roots that don ' t strictly need to be freed , but
* such false positives are relatively rare :
*
KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
When attempting to allocate a shadow root for a !visible guest root gfn,
e.g. that resides in MMIO space, load a dummy root that is backed by the
zero page instead of immediately synthesizing a triple fault shutdown
(using the zero page ensures any attempt to translate memory will generate
a !PRESENT fault and thus VM-Exit).
Unless the vCPU is racing with memslot activity, KVM will inject a page
fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e.
the end result is mostly same, but critically KVM will inject a fault only
*after* KVM runs the vCPU with the bogus root.
Waiting to inject a fault until after running the vCPU fixes a bug where
KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled
and a !visible root. Even though a bad root will *probably* lead to
shutdown, (a) it's not guaranteed and (b) the CPU won't read the
underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with
a VMX preemption timer value of '0', then architecturally the preemption
timer VM-Exit is guaranteed to occur before the CPU executes any
instruction, i.e. before the CPU needs to translate a GPA to a HPA (so
long as there are no injected events with higher priority than the
preemption timer).
If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because
userspace created a memslot between installing the dummy root and handling
the page fault, simply unload the MMU to allocate a new root and retry the
instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as
invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and
conceptually the dummy root has indeeed become obsolete. The only
difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is
that the root has become obsolete due to memslot *creation*, not memslot
deletion or movement.
Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:52:00 -07:00
* ( a ) only PAE paging and nested NPT have roots without shadow pages
* ( or any shadow paging flavor with a dummy root , see note below )
2022-02-25 18:22:45 +00:00
* ( b ) remote reloads due to a memslot update obsoletes _all_ roots
* ( c ) KVM doesn ' t track previous roots for PAE paging , and the guest
* is unlikely to zap an in - use PGD .
KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
When attempting to allocate a shadow root for a !visible guest root gfn,
e.g. that resides in MMIO space, load a dummy root that is backed by the
zero page instead of immediately synthesizing a triple fault shutdown
(using the zero page ensures any attempt to translate memory will generate
a !PRESENT fault and thus VM-Exit).
Unless the vCPU is racing with memslot activity, KVM will inject a page
fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e.
the end result is mostly same, but critically KVM will inject a fault only
*after* KVM runs the vCPU with the bogus root.
Waiting to inject a fault until after running the vCPU fixes a bug where
KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled
and a !visible root. Even though a bad root will *probably* lead to
shutdown, (a) it's not guaranteed and (b) the CPU won't read the
underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with
a VMX preemption timer value of '0', then architecturally the preemption
timer VM-Exit is guaranteed to occur before the CPU executes any
instruction, i.e. before the CPU needs to translate a GPA to a HPA (so
long as there are no injected events with higher priority than the
preemption timer).
If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because
userspace created a memslot between installing the dummy root and handling
the page fault, simply unload the MMU to allocate a new root and retry the
instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as
invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and
conceptually the dummy root has indeeed become obsolete. The only
difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is
that the root has become obsolete due to memslot *creation*, not memslot
deletion or movement.
Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:52:00 -07:00
*
* Note ! Dummy roots are unique in that they are obsoleted by memslot
* _creation_ ! See also FNAME ( fetch ) .
2022-02-25 18:22:45 +00:00
*/
2023-07-28 17:51:56 -07:00
sp = root_to_sp ( root_hpa ) ;
2022-02-25 18:22:45 +00:00
return ! sp | | is_obsolete_sp ( kvm , sp ) ;
}
static void __kvm_mmu_free_obsolete_roots ( struct kvm * kvm , struct kvm_mmu * mmu )
{
unsigned long roots_to_free = 0 ;
int i ;
if ( is_obsolete_root ( kvm , mmu - > root . hpa ) )
roots_to_free | = KVM_MMU_ROOT_CURRENT ;
for ( i = 0 ; i < KVM_MMU_NUM_PREV_ROOTS ; i + + ) {
2022-06-06 18:59:05 -06:00
if ( is_obsolete_root ( kvm , mmu - > prev_roots [ i ] . hpa ) )
2022-02-25 18:22:45 +00:00
roots_to_free | = KVM_MMU_ROOT_PREVIOUS ( i ) ;
}
if ( roots_to_free )
kvm_mmu_free_roots ( kvm , mmu , roots_to_free ) ;
}
void kvm_mmu_free_obsolete_roots ( struct kvm_vcpu * vcpu )
{
__kvm_mmu_free_obsolete_roots ( vcpu - > kvm , & vcpu - > arch . root_mmu ) ;
__kvm_mmu_free_obsolete_roots ( vcpu - > kvm , & vcpu - > arch . guest_mmu ) ;
}
2011-09-22 16:57:23 +08:00
static u64 mmu_pte_write_fetch_gpte ( struct kvm_vcpu * vcpu , gpa_t * gpa ,
2018-10-31 14:53:57 -07:00
int * bytes )
2007-01-05 16:36:44 -08:00
{
2018-10-31 14:53:57 -07:00
u64 gentry = 0 ;
2011-09-22 16:57:23 +08:00
int r ;
2010-03-15 13:59:53 +02:00
/*
* Assume that the pte write on a page table of the same type
2011-03-04 19:00:00 +08:00
* as the current vcpu paging mode since we update the sptes only
* when they have the same mode .
2010-03-15 13:59:53 +02:00
*/
2011-09-22 16:57:23 +08:00
if ( is_pae ( vcpu ) & & * bytes = = 4 ) {
2010-03-15 13:59:53 +02:00
/* Handle a 32-bit guest writing two halves of a 64-bit gpte */
2011-09-22 16:57:23 +08:00
* gpa & = ~ ( gpa_t ) 7 ;
* bytes = 8 ;
2010-03-15 13:59:57 +02:00
}
2018-10-31 14:53:57 -07:00
if ( * bytes = = 4 | | * bytes = = 8 ) {
r = kvm_vcpu_read_guest_atomic ( vcpu , * gpa , & gentry , * bytes ) ;
if ( r )
gentry = 0 ;
2010-03-15 13:59:53 +02:00
}
2011-09-22 16:57:23 +08:00
return gentry ;
}
/*
* If we ' re seeing too many writes to a page , it may no longer be a page table ,
* or we may be forking , in which case it is better to unmap the page .
*/
2011-12-16 18:18:10 +08:00
static bool detect_write_flooding ( struct kvm_mmu_page * sp )
2011-09-22 16:57:23 +08:00
{
KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough
Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-22 16:58:36 +08:00
/*
* Skip write - flooding detected for the sp whose level is 1 , because
* it can become unsync , then the guest page is not write - protected .
*/
2020-04-27 17:54:22 -07:00
if ( sp - > role . level = = PG_LEVEL_4K )
KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough
Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-22 16:58:36 +08:00
return false ;
2010-04-16 16:35:54 +08:00
2016-02-24 17:51:12 +08:00
atomic_inc ( & sp - > write_flooding_count ) ;
return atomic_read ( & sp - > write_flooding_count ) > = 3 ;
2011-09-22 16:57:23 +08:00
}
/*
* Misaligned accesses are too much trouble to fix up ; also , they usually
* indicate a page is not used as a page table .
*/
static bool detect_write_misaligned ( struct kvm_mmu_page * sp , gpa_t gpa ,
int bytes )
{
unsigned offset , pte_size , misaligned ;
offset = offset_in_page ( gpa ) ;
2021-11-24 20:20:51 +08:00
pte_size = sp - > role . has_4_byte_gpte ? 4 : 8 ;
2011-09-22 16:57:55 +08:00
/*
* Sometimes , the OS only writes the last one bytes to update status
* bits , for example , in linux , andb instruction is used in clear_bit ( ) .
*/
if ( ! ( offset & ( pte_size - 1 ) ) & & bytes = = 1 )
return false ;
2011-09-22 16:57:23 +08:00
misaligned = ( offset ^ ( offset + bytes - 1 ) ) & ~ ( pte_size - 1 ) ;
misaligned | = bytes < 4 ;
return misaligned ;
}
static u64 * get_written_sptes ( struct kvm_mmu_page * sp , gpa_t gpa , int * nspte )
{
unsigned page_offset , quadrant ;
u64 * spte ;
int level ;
page_offset = offset_in_page ( gpa ) ;
level = sp - > role . level ;
* nspte = 1 ;
2021-11-24 20:20:51 +08:00
if ( sp - > role . has_4_byte_gpte ) {
2011-09-22 16:57:23 +08:00
page_offset < < = 1 ; /* 32->64 */
/*
* A 32 - bit pde maps 4 MB while the shadow pdes map
* only 2 MB . So we need to double the offset again
* and zap two pdes instead of one .
*/
if ( level = = PT32_ROOT_LEVEL ) {
page_offset & = ~ 7 ; /* kill rounding error */
page_offset < < = 1 ;
* nspte = 2 ;
}
quadrant = page_offset > > PAGE_SHIFT ;
page_offset & = ~ PAGE_MASK ;
if ( quadrant ! = sp - > role . quadrant )
return NULL ;
}
spte = & sp - > spt [ page_offset / sizeof ( * spte ) ] ;
return spte ;
}
2023-07-28 18:35:20 -07:00
void kvm_mmu_track_write ( struct kvm_vcpu * vcpu , gpa_t gpa , const u8 * new ,
int bytes )
2011-09-22 16:57:23 +08:00
{
gfn_t gfn = gpa > > PAGE_SHIFT ;
struct kvm_mmu_page * sp ;
LIST_HEAD ( invalid_list ) ;
u64 entry , gentry , * spte ;
int npte ;
2021-09-18 08:56:31 +08:00
bool flush = false ;
2011-09-22 16:57:23 +08:00
/*
* If we don ' t have indirect shadow pages , it means no page is
* write - protected , so we can exit simply .
*/
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-23 14:07:29 -07:00
if ( ! READ_ONCE ( vcpu - > kvm - > arch . indirect_shadow_pages ) )
2011-09-22 16:57:23 +08:00
return ;
2021-02-02 10:57:24 -08:00
write_lock ( & vcpu - > kvm - > mmu_lock ) ;
2018-10-31 14:53:57 -07:00
gentry = mmu_pte_write_fetch_gpte ( vcpu , & gpa , & bytes ) ;
2011-09-22 16:57:23 +08:00
+ + vcpu - > kvm - > stat . mmu_pte_write ;
2022-04-20 21:12:03 +08:00
for_each_gfn_valid_sp_with_gptes ( vcpu - > kvm , sp , gfn ) {
KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough
Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-22 16:58:36 +08:00
if ( detect_write_misaligned ( sp , gpa , bytes ) | |
2011-12-16 18:18:10 +08:00
detect_write_flooding ( sp ) ) {
2016-02-24 11:21:55 +01:00
kvm_mmu_prepare_zap_page ( vcpu - > kvm , sp , & invalid_list ) ;
2007-11-18 16:37:07 +02:00
+ + vcpu - > kvm - > stat . mmu_flooded ;
2007-01-05 16:36:48 -08:00
continue ;
}
2011-09-22 16:57:23 +08:00
spte = get_written_sptes ( sp , gpa , & npte ) ;
if ( ! spte )
continue ;
2007-03-08 17:13:32 +02:00
while ( npte - - ) {
2007-11-21 02:06:21 +02:00
entry = * spte ;
2020-09-23 15:14:06 -07:00
mmu_page_zap_pte ( vcpu - > kvm , sp , spte , NULL ) ;
KVM: x86/mmu: Remove the defunct update_pte() paging hook
Remove the update_pte() shadow paging logic, which was obsoleted by
commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"), but never
removed. As pointed out by Yu, KVM never write protects leaf page
tables for the purposes of shadow paging, and instead marks their
associated shadow page as unsync so that the guest can write PTEs at
will.
The update_pte() path, which predates the unsync logic, optimizes COW
scenarios by refreshing leaf SPTEs when they are written, as opposed to
zapping the SPTE, restarting the guest, and installing the new SPTE on
the subsequent fault. Since KVM no longer write-protects leaf page
tables, update_pte() is unreachable and can be dropped.
Reported-by: Yu Zhang <yu.c.zhang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210115004051.4099250-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-14 16:40:51 -08:00
if ( gentry & & sp - > role . level ! = PG_LEVEL_4K )
+ + vcpu - > kvm - > stat . mmu_pde_zapped ;
2022-07-22 19:43:16 -07:00
if ( is_shadow_present_pte ( entry ) )
2021-09-18 08:56:31 +08:00
flush = true ;
2007-03-08 17:13:32 +02:00
+ + spte ;
2007-01-05 16:36:45 -08:00
}
}
2021-09-18 08:56:31 +08:00
kvm_mmu_remote_flush_or_zap ( vcpu - > kvm , & invalid_list , flush ) ;
2021-02-02 10:57:24 -08:00
write_unlock ( & vcpu - > kvm - > mmu_lock ) ;
2007-01-05 16:36:44 -08:00
}
2022-04-23 03:47:49 +00:00
int noinline kvm_mmu_page_fault ( struct kvm_vcpu * vcpu , gpa_t cr2_or_gpa , u64 error_code ,
2010-12-21 11:12:07 +01:00
void * insn , int insn_len )
2007-10-28 18:48:59 +02:00
{
2020-02-18 15:03:08 -08:00
int r , emulation_type = EMULTYPE_PF ;
2022-02-10 08:00:56 -05:00
bool direct = vcpu - > arch . mmu - > root_role . direct ;
2007-10-28 18:48:59 +02:00
2023-07-21 15:37:11 -07:00
/*
* IMPLICIT_ACCESS is a KVM - defined flag used to correctly perform SMAP
* checks when emulating instructions that triggers implicit access .
* WARN if hardware generates a fault with an error code that collides
* with the KVM - defined value . Clear the flag and continue on , i . e .
* don ' t terminate the VM , as KVM can ' t possibly be relying on a flag
* that KVM doesn ' t know about .
*/
if ( WARN_ON_ONCE ( error_code & PFERR_IMPLICIT_ACCESS ) )
error_code & = ~ PFERR_IMPLICIT_ACCESS ;
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
if ( WARN_ON_ONCE ( ! VALID_PAGE ( vcpu - > arch . mmu - > root . hpa ) ) )
KVM: x86/mmu: Move root_hpa validity checks to top of page fault handler
Add a check on root_hpa at the beginning of the page fault handler to
consolidate several checks on root_hpa that are scattered throughout the
page fault code. This is a preparatory step towards eventually removing
such checks altogether, or at the very least WARNing if an invalid root
is encountered. Remove only the checks that can be easily audited to
confirm that root_hpa cannot be invalidated between their current
location and the new check in kvm_mmu_page_fault(), and aren't currently
protected by mmu_lock, i.e. keep the checks in __direct_map() and
FNAME(fetch) for the time being.
The root_hpa checks that are consolidate were all added by commit
37f6a4e237303 ("KVM: x86: handle invalid root_hpa everywhere")
which was a follow up to a bug fix for __direct_map(), commit
989c6b34f6a94 ("KVM: MMU: handle invalid root_hpa at __direct_map")
At the time, nested VMX had, in hindsight, crazy handling of nested
interrupts and would trigger a nested VM-Exit in ->interrupt_allowed(),
and thus unexpectedly reset the MMU in flows such as can_do_async_pf().
Now that the wonky nested VM-Exit behavior is gone, the root_hpa checks
are bogus and confusing, e.g. it's not at all obvious what they actually
protect against, and at first glance they appear to be broken since many
of them run without holding mmu_lock.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-06 15:57:27 -08:00
return RET_PF_RETRY ;
2017-08-17 15:03:32 +02:00
r = RET_PF_INVALID ;
2016-02-22 17:23:41 +09:00
if ( unlikely ( error_code & PFERR_RSVD_MASK ) ) {
2019-12-06 15:57:14 -08:00
r = handle_mmio_page_fault ( vcpu , cr2_or_gpa , direct ) ;
2018-08-23 13:56:50 -07:00
if ( r = = RET_PF_EMULATE )
2016-02-22 17:23:41 +09:00
goto emulate ;
}
2007-10-28 18:48:59 +02:00
2017-08-17 15:03:32 +02:00
if ( r = = RET_PF_INVALID ) {
2020-02-06 14:14:34 -08:00
r = kvm_mmu_do_page_fault ( vcpu , cr2_or_gpa ,
2023-02-02 18:28:15 +00:00
lower_32_bits ( error_code ) , false ,
& emulation_type ) ;
2021-07-02 15:04:26 -07:00
if ( KVM_BUG_ON ( r = = RET_PF_INVALID , vcpu - > kvm ) )
2020-09-23 15:04:22 -07:00
return - EIO ;
2017-08-17 15:03:32 +02:00
}
2007-10-28 18:48:59 +02:00
if ( r < 0 )
2016-02-22 17:23:41 +09:00
return r ;
2020-09-23 15:04:23 -07:00
if ( r ! = RET_PF_EMULATE )
return 1 ;
2007-10-28 18:48:59 +02:00
2016-11-23 12:01:38 -05:00
/*
* Before emulating the instruction , check if the error code
* was due to a RO violation while translating the guest page .
* This can occur when using nested virtualization with nested
* paging in both guests . If true , we simply unprotect the page
* and resume the guest .
*/
2022-02-10 08:00:56 -05:00
if ( vcpu - > arch . mmu - > root_role . direct & &
2016-11-28 14:39:58 +01:00
( error_code & PFERR_NESTED_GUEST_PAGE ) = = PFERR_NESTED_GUEST_PAGE ) {
2019-12-06 15:57:14 -08:00
kvm_mmu_unprotect_page ( vcpu - > kvm , gpa_to_gfn ( cr2_or_gpa ) ) ;
2016-11-23 12:01:38 -05:00
return 1 ;
}
2018-08-23 13:56:50 -07:00
/*
* vcpu - > arch . mmu . page_fault returned RET_PF_EMULATE , but we can still
* optimistically try to just unprotect the page and let the processor
* re - execute the instruction that caused the page fault . Do not allow
* retrying MMIO emulation , as it ' s not only pointless but could also
* cause us to enter an infinite loop because the processor will keep
KVM: x86: Do not re-{try,execute} after failed emulation in L2
Commit a6f177efaa58 ("KVM: Reenter guest after emulation failure if
due to access to non-mmio address") added reexecute_instruction() to
handle the scenario where two (or more) vCPUS race to write a shadowed
page, i.e. reexecute_instruction() is intended to return true if and
only if the instruction being emulated was accessing a shadowed page.
As L0 is only explicitly shadowing L1 tables, an emulation failure of
a nested VM instruction cannot be due to a race to write a shadowed
page and so should never be re-executed.
This fixes an issue where an "MMIO" emulation failure[1] in L2 is all
but guaranteed to result in an infinite loop when TDP is enabled.
Because "cr2" is actually an L2 GPA when TDP is enabled, calling
kvm_mmu_gva_to_gpa_write() to translate cr2 in the non-direct mapped
case (L2 is never direct mapped) will almost always yield UNMAPPED_GVA
and cause reexecute_instruction() to immediately return true. The
!mmio_info_in_cache() check in kvm_mmu_page_fault() doesn't catch this
case because mmio_info_in_cache() returns false for a nested MMU (the
MMIO caching currently handles L1 only, e.g. to cache nested guests'
GPAs we'd have to manually flush the cache when switching between
VMs and when L1 updated its page tables controlling the nested guest).
Way back when, commit 68be0803456b ("KVM: x86: never re-execute
instruction with enabled tdp") changed reexecute_instruction() to
always return false when using TDP under the assumption that KVM would
only get into the emulator for MMIO. Commit 95b3cf69bdf8 ("KVM: x86:
let reexecute_instruction work for tdp") effectively reverted that
behavior in order to handle the scenario where emulation failed due to
an access from L1 to the shadow page tables for L2, but it didn't
account for the case where emulation failed in L2 with TDP enabled.
All of the above logic also applies to retry_instruction(), added by
commit 1cb3f3ae5a38 ("KVM: x86: retry non-page-table writing
instructions"). An indefinite loop in retry_instruction() should be
impossible as it protects against retrying the same instruction over
and over, but it's still correct to not retry an L2 instruction in
the first place.
Fix the immediate issue by adding a check for a nested guest when
determining whether or not to allow retry in kvm_mmu_page_fault().
In addition to fixing the immediate bug, add WARN_ON_ONCE in the
retry functions since they are not designed to handle nested cases,
i.e. they need to be modified even if there is some scenario in the
future where we want to allow retrying a nested guest.
[1] This issue was encountered after commit 3a2936dedd20 ("kvm: mmu:
Don't expose private memslots to L2") changed the page fault path
to return KVM_PFN_NOSLOT when translating an L2 access to a
prive memslot. Returning KVM_PFN_NOSLOT is semantically correct
when we want to hide a memslot from L2, i.e. there effectively is
no defined memory region for L2, but it has the unfortunate side
effect of making KVM think the GFN is a MMIO page, thus triggering
emulation. The failure occurred with in-development code that
deliberately exposed a private memslot to L2, which L2 accessed
with an instruction that is not emulated by KVM.
Fixes: 95b3cf69bdf8 ("KVM: x86: let reexecute_instruction work for tdp")
Fixes: 1cb3f3ae5a38 ("KVM: x86: retry non-page-table writing instructions")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-23 13:56:51 -07:00
* faulting on the non - existent MMIO address . Retrying an instruction
* from a nested guest is also pointless and dangerous as we are only
* explicitly shadowing L1 ' s page tables , i . e . unprotecting something
* for L1 isn ' t going to magically fix whatever issue cause L2 to fail .
2018-08-23 13:56:50 -07:00
*/
2019-12-06 15:57:14 -08:00
if ( ! mmio_info_in_cache ( vcpu , cr2_or_gpa , direct ) & & ! is_guest_mode ( vcpu ) )
2020-02-18 15:03:08 -08:00
emulation_type | = EMULTYPE_ALLOW_RETRY_PF ;
2016-02-22 17:23:41 +09:00
emulate :
2019-12-06 15:57:14 -08:00
return x86_emulate_instruction ( vcpu , cr2_or_gpa , emulation_type , insn ,
KVM: x86: Remove emulation_result enums, EMULATE_{DONE,FAIL,USER_EXIT}
Deferring emulation failure handling (in some cases) to the caller of
x86_emulate_instruction() has proven fragile, e.g. multiple instances of
KVM not setting run->exit_reason on EMULATE_FAIL, largely due to it
being difficult to discern what emulation types can return what result,
and which combination of types and results are handled where.
Now that x86_emulate_instruction() always handles emulation failure,
i.e. EMULATION_FAIL is only referenced in callers, remove the
emulation_result enums entirely. Per KVM's existing exit handling
conventions, return '0' and '1' for "exit to userspace" and "resume
guest" respectively. Doing so cleans up many callers, e.g. they can
return kvm_emulate_instruction() directly instead of having to interpret
its result.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-27 14:40:38 -07:00
insn_len ) ;
2007-10-28 18:48:59 +02:00
}
EXPORT_SYMBOL_GPL ( kvm_mmu_page_fault ) ;
KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB instead.
In hardware TLB, invalidating TLB entries means the translations are
removed from the TLB.
In KVM shadowed vTLB, the translations (combinations of shadow paging
and hardware TLB) are generally maintained as long as they remain "clean"
when the TLB of an address space (i.e. a PCID or all) is flushed with
the help of write-protections, sp->unsync, and kvm_sync_page(), where
"clean" in this context means that no updates to KVM's SPTEs are needed.
However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is
unsync, and thus triggers a remote flush even if the original vTLB entry
is clean, i.e. is usable as-is.
Besides this, FNAME(invlpg) is largely is a duplicate implementation of
FNAME(sync_spte) to invalidate a vTLB entry.
To address both issues, reuse FNAME(sync_spte) to share the code and
slightly modify the semantics, i.e. keep the vTLB entry if it's "clean"
and avoid remote TLB flush.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-17 07:53:19 +08:00
static void __kvm_mmu_invalidate_addr ( struct kvm_vcpu * vcpu , struct kvm_mmu * mmu ,
u64 addr , hpa_t root_hpa )
{
struct kvm_shadow_walk_iterator iterator ;
vcpu_clear_mmio_info ( vcpu , addr ) ;
2023-05-23 11:29:47 +08:00
/*
* Walking and synchronizing SPTEs both assume they are operating in
* the context of the current MMU , and would need to be reworked if
* this is ever used to sync the guest_mmu , e . g . to emulate INVEPT .
*/
if ( WARN_ON_ONCE ( mmu ! = vcpu - > arch . mmu ) )
return ;
KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB instead.
In hardware TLB, invalidating TLB entries means the translations are
removed from the TLB.
In KVM shadowed vTLB, the translations (combinations of shadow paging
and hardware TLB) are generally maintained as long as they remain "clean"
when the TLB of an address space (i.e. a PCID or all) is flushed with
the help of write-protections, sp->unsync, and kvm_sync_page(), where
"clean" in this context means that no updates to KVM's SPTEs are needed.
However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is
unsync, and thus triggers a remote flush even if the original vTLB entry
is clean, i.e. is usable as-is.
Besides this, FNAME(invlpg) is largely is a duplicate implementation of
FNAME(sync_spte) to invalidate a vTLB entry.
To address both issues, reuse FNAME(sync_spte) to share the code and
slightly modify the semantics, i.e. keep the vTLB entry if it's "clean"
and avoid remote TLB flush.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-17 07:53:19 +08:00
if ( ! VALID_PAGE ( root_hpa ) )
return ;
write_lock ( & vcpu - > kvm - > mmu_lock ) ;
for_each_shadow_entry_using_root ( vcpu , root_hpa , addr , iterator ) {
struct kvm_mmu_page * sp = sptep_to_sp ( iterator . sptep ) ;
if ( sp - > unsync ) {
2023-02-17 07:53:21 +08:00
int ret = kvm_sync_spte ( vcpu , sp , iterator . index ) ;
KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB instead.
In hardware TLB, invalidating TLB entries means the translations are
removed from the TLB.
In KVM shadowed vTLB, the translations (combinations of shadow paging
and hardware TLB) are generally maintained as long as they remain "clean"
when the TLB of an address space (i.e. a PCID or all) is flushed with
the help of write-protections, sp->unsync, and kvm_sync_page(), where
"clean" in this context means that no updates to KVM's SPTEs are needed.
However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is
unsync, and thus triggers a remote flush even if the original vTLB entry
is clean, i.e. is usable as-is.
Besides this, FNAME(invlpg) is largely is a duplicate implementation of
FNAME(sync_spte) to invalidate a vTLB entry.
To address both issues, reuse FNAME(sync_spte) to share the code and
slightly modify the semantics, i.e. keep the vTLB entry if it's "clean"
and avoid remote TLB flush.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-17 07:53:19 +08:00
if ( ret < 0 )
mmu_page_zap_pte ( vcpu - > kvm , sp , iterator . sptep , NULL ) ;
if ( ret )
kvm_flush_remote_tlbs_sptep ( vcpu - > kvm , iterator . sptep ) ;
}
if ( ! sp - > unsync_children )
break ;
}
write_unlock ( & vcpu - > kvm - > mmu_lock ) ;
}
2023-02-16 23:41:07 +08:00
void kvm_mmu_invalidate_addr ( struct kvm_vcpu * vcpu , struct kvm_mmu * mmu ,
2023-02-16 23:41:14 +08:00
u64 addr , unsigned long roots )
2008-09-23 13:18:35 -03:00
{
2018-06-27 14:59:20 -07:00
int i ;
2018-06-27 14:59:16 -07:00
2023-02-16 23:41:14 +08:00
WARN_ON_ONCE ( roots & ~ KVM_MMU_ROOTS_ALL ) ;
2020-03-23 20:42:57 -04:00
/* It's actually a GPA for vcpu->arch.guest_mmu. */
if ( mmu ! = & vcpu - > arch . guest_mmu ) {
/* INVLPG on a non-canonical address is a NOP according to the SDM. */
2023-02-16 23:41:07 +08:00
if ( is_noncanonical_address ( addr , vcpu ) )
2020-03-23 20:42:57 -04:00
return ;
2023-02-16 23:41:07 +08:00
static_call ( kvm_x86_flush_tlb_gva ) ( vcpu , addr ) ;
2020-03-23 20:42:57 -04:00
}
KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB instead.
In hardware TLB, invalidating TLB entries means the translations are
removed from the TLB.
In KVM shadowed vTLB, the translations (combinations of shadow paging
and hardware TLB) are generally maintained as long as they remain "clean"
when the TLB of an address space (i.e. a PCID or all) is flushed with
the help of write-protections, sp->unsync, and kvm_sync_page(), where
"clean" in this context means that no updates to KVM's SPTEs are needed.
However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is
unsync, and thus triggers a remote flush even if the original vTLB entry
is clean, i.e. is usable as-is.
Besides this, FNAME(invlpg) is largely is a duplicate implementation of
FNAME(sync_spte) to invalidate a vTLB entry.
To address both issues, reuse FNAME(sync_spte) to share the code and
slightly modify the semantics, i.e. keep the vTLB entry if it's "clean"
and avoid remote TLB flush.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-17 07:53:19 +08:00
if ( ! mmu - > sync_spte )
2018-06-29 13:10:05 -07:00
return ;
2023-02-16 23:41:14 +08:00
if ( roots & KVM_MMU_ROOT_CURRENT )
KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB instead.
In hardware TLB, invalidating TLB entries means the translations are
removed from the TLB.
In KVM shadowed vTLB, the translations (combinations of shadow paging
and hardware TLB) are generally maintained as long as they remain "clean"
when the TLB of an address space (i.e. a PCID or all) is flushed with
the help of write-protections, sp->unsync, and kvm_sync_page(), where
"clean" in this context means that no updates to KVM's SPTEs are needed.
However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is
unsync, and thus triggers a remote flush even if the original vTLB entry
is clean, i.e. is usable as-is.
Besides this, FNAME(invlpg) is largely is a duplicate implementation of
FNAME(sync_spte) to invalidate a vTLB entry.
To address both issues, reuse FNAME(sync_spte) to share the code and
slightly modify the semantics, i.e. keep the vTLB entry if it's "clean"
and avoid remote TLB flush.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-17 07:53:19 +08:00
__kvm_mmu_invalidate_addr ( vcpu , mmu , addr , mmu - > root . hpa ) ;
2018-06-27 14:59:18 -07:00
2023-02-16 23:41:14 +08:00
for ( i = 0 ; i < KVM_MMU_NUM_PREV_ROOTS ; i + + ) {
2023-02-17 07:53:18 +08:00
if ( roots & KVM_MMU_ROOT_PREVIOUS ( i ) )
KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB instead.
In hardware TLB, invalidating TLB entries means the translations are
removed from the TLB.
In KVM shadowed vTLB, the translations (combinations of shadow paging
and hardware TLB) are generally maintained as long as they remain "clean"
when the TLB of an address space (i.e. a PCID or all) is flushed with
the help of write-protections, sp->unsync, and kvm_sync_page(), where
"clean" in this context means that no updates to KVM's SPTEs are needed.
However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is
unsync, and thus triggers a remote flush even if the original vTLB entry
is clean, i.e. is usable as-is.
Besides this, FNAME(invlpg) is largely is a duplicate implementation of
FNAME(sync_spte) to invalidate a vTLB entry.
To address both issues, reuse FNAME(sync_spte) to share the code and
slightly modify the semantics, i.e. keep the vTLB entry if it's "clean"
and avoid remote TLB flush.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-17 07:53:19 +08:00
__kvm_mmu_invalidate_addr ( vcpu , mmu , addr , mmu - > prev_roots [ i ] . hpa ) ;
2020-03-23 20:42:57 -04:00
}
}
2023-02-17 07:53:17 +08:00
EXPORT_SYMBOL_GPL ( kvm_mmu_invalidate_addr ) ;
2018-06-27 14:59:18 -07:00
2020-03-23 20:42:57 -04:00
void kvm_mmu_invlpg ( struct kvm_vcpu * vcpu , gva_t gva )
{
2023-02-16 23:41:14 +08:00
/*
* INVLPG is required to invalidate any global mappings for the VA ,
* irrespective of PCID . Blindly sync all roots as it would take
* roughly the same amount of work / time to determine whether any of the
* previous roots have a global mapping .
*
* Mappings not reachable via the current or previous cached roots will
* be synced when switching to that new cr3 , so nothing needs to be
* done here for them .
*/
kvm_mmu_invalidate_addr ( vcpu , vcpu - > arch . walk_mmu , gva , KVM_MMU_ROOTS_ALL ) ;
2008-09-23 13:18:35 -03:00
+ + vcpu - > stat . invlpg ;
}
EXPORT_SYMBOL_GPL ( kvm_mmu_invlpg ) ;
2020-03-23 20:42:57 -04:00
2018-06-27 14:59:14 -07:00
void kvm_mmu_invpcid_gva ( struct kvm_vcpu * vcpu , gva_t gva , unsigned long pcid )
{
2018-10-08 21:28:05 +02:00
struct kvm_mmu * mmu = vcpu - > arch . mmu ;
2023-02-16 23:41:15 +08:00
unsigned long roots = 0 ;
2018-06-27 14:59:20 -07:00
uint i ;
2018-06-27 14:59:14 -07:00
2023-02-16 23:41:15 +08:00
if ( pcid = = kvm_get_active_pcid ( vcpu ) )
roots | = KVM_MMU_ROOT_CURRENT ;
2018-06-27 14:59:14 -07:00
2018-06-27 14:59:20 -07:00
for ( i = 0 ; i < KVM_MMU_NUM_PREV_ROOTS ; i + + ) {
if ( VALID_PAGE ( mmu - > prev_roots [ i ] . hpa ) & &
2023-02-16 23:41:15 +08:00
pcid = = kvm_get_pcid ( vcpu , mmu - > prev_roots [ i ] . pgd ) )
roots | = KVM_MMU_ROOT_PREVIOUS ( i ) ;
2018-06-27 14:59:18 -07:00
}
2018-06-27 14:59:15 -07:00
2023-02-16 23:41:15 +08:00
if ( roots )
kvm_mmu_invalidate_addr ( vcpu , mmu , gva , roots ) ;
2018-06-27 14:59:14 -07:00
+ + vcpu - > stat . invlpg ;
/*
2018-06-27 14:59:20 -07:00
* Mappings not reachable via the current cr3 or the prev_roots will be
* synced when switching to that cr3 , so nothing needs to be done here
* for them .
2018-06-27 14:59:14 -07:00
*/
}
2021-08-18 11:55:47 -05:00
void kvm_configure_mmu ( bool enable_tdp , int tdp_forced_root_level ,
int tdp_max_root_level , int tdp_huge_page_level )
2008-02-07 13:47:41 +01:00
{
2020-03-02 15:57:02 -08:00
tdp_enabled = enable_tdp ;
2021-08-18 11:55:47 -05:00
tdp_root_level = tdp_forced_root_level ;
2020-07-15 20:41:22 -07:00
max_tdp_level = tdp_max_root_level ;
2020-03-02 15:57:03 -08:00
2022-09-21 10:35:37 -07:00
# ifdef CONFIG_X86_64
tdp_mmu_enabled = tdp_mmu_allowed & & tdp_enabled ;
# endif
2020-03-02 15:57:03 -08:00
/*
2020-07-15 20:41:21 -07:00
* max_huge_page_level reflects KVM ' s MMU capabilities irrespective
2020-03-02 15:57:03 -08:00
* of kernel support , e . g . KVM may be capable of using 1 GB pages when
* the kernel is not . But , KVM never creates a page size greater than
* what is used by the kernel for any given HVA , i . e . the kernel ' s
* capabilities are ultimately consulted by kvm_mmu_hugepage_adjust ( ) .
*/
if ( tdp_enabled )
2020-07-15 20:41:21 -07:00
max_huge_page_level = tdp_huge_page_level ;
2020-03-02 15:57:03 -08:00
else if ( boot_cpu_has ( X86_FEATURE_GBPAGES ) )
2020-07-15 20:41:21 -07:00
max_huge_page_level = PG_LEVEL_1G ;
2020-03-02 15:57:03 -08:00
else
2020-07-15 20:41:21 -07:00
max_huge_page_level = PG_LEVEL_2M ;
2008-02-07 13:47:41 +01:00
}
2020-03-02 15:57:02 -08:00
EXPORT_SYMBOL_GPL ( kvm_configure_mmu ) ;
2019-02-05 13:01:19 -08:00
/* The return value indicates if tlb flush on all vcpus is needed. */
2023-02-02 18:27:49 +00:00
typedef bool ( * slot_rmaps_handler ) ( struct kvm * kvm ,
2021-07-12 22:33:38 -04:00
struct kvm_rmap_head * rmap_head ,
const struct kvm_memory_slot * slot ) ;
2019-02-05 13:01:19 -08:00
2023-02-02 18:27:49 +00:00
static __always_inline bool __walk_slot_rmaps ( struct kvm * kvm ,
const struct kvm_memory_slot * slot ,
slot_rmaps_handler fn ,
int start_level , int end_level ,
gfn_t start_gfn , gfn_t end_gfn ,
bool flush_on_yield , bool flush )
2019-02-05 13:01:19 -08:00
{
struct slot_rmap_walk_iterator iterator ;
2023-02-02 18:27:50 +00:00
lockdep_assert_held_write ( & kvm - > mmu_lock ) ;
2023-02-02 18:27:49 +00:00
for_each_slot_rmap_range ( slot , start_level , end_level , start_gfn ,
2019-02-05 13:01:19 -08:00
end_gfn , & iterator ) {
if ( iterator . rmap )
2023-02-02 18:27:49 +00:00
flush | = fn ( kvm , iterator . rmap , slot ) ;
2019-02-05 13:01:19 -08:00
2021-02-02 10:57:24 -08:00
if ( need_resched ( ) | | rwlock_needbreak ( & kvm - > mmu_lock ) ) {
2021-03-25 19:19:41 -07:00
if ( flush & & flush_on_yield ) {
2023-01-26 10:40:22 -08:00
kvm_flush_remote_tlbs_range ( kvm , start_gfn ,
iterator . gfn - start_gfn + 1 ) ;
2019-02-05 13:01:19 -08:00
flush = false ;
}
2021-02-02 10:57:24 -08:00
cond_resched_rwlock_write ( & kvm - > mmu_lock ) ;
2019-02-05 13:01:19 -08:00
}
}
return flush ;
}
2023-02-02 18:27:49 +00:00
static __always_inline bool walk_slot_rmaps ( struct kvm * kvm ,
const struct kvm_memory_slot * slot ,
slot_rmaps_handler fn ,
int start_level , int end_level ,
bool flush_on_yield )
2019-02-05 13:01:19 -08:00
{
2023-02-02 18:27:49 +00:00
return __walk_slot_rmaps ( kvm , slot , fn , start_level , end_level ,
slot - > base_gfn , slot - > base_gfn + slot - > npages - 1 ,
flush_on_yield , false ) ;
2019-02-05 13:01:19 -08:00
}
2023-02-02 18:27:49 +00:00
static __always_inline bool walk_slot_rmaps_4k ( struct kvm * kvm ,
const struct kvm_memory_slot * slot ,
slot_rmaps_handler fn ,
bool flush_on_yield )
2019-02-05 13:01:19 -08:00
{
2023-02-02 18:27:49 +00:00
return walk_slot_rmaps ( kvm , slot , fn , PG_LEVEL_4K , PG_LEVEL_4K , flush_on_yield ) ;
2019-02-05 13:01:19 -08:00
}
2019-06-22 19:42:04 +02:00
static void free_mmu_pages ( struct kvm_mmu * mmu )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
2021-03-09 14:42:07 -08:00
if ( ! tdp_enabled & & mmu - > pae_root )
set_memory_encrypted ( ( unsigned long ) mmu - > pae_root , 1 ) ;
2019-06-22 19:42:04 +02:00
free_page ( ( unsigned long ) mmu - > pae_root ) ;
2021-05-05 13:42:21 -07:00
free_page ( ( unsigned long ) mmu - > pml4_root ) ;
2021-08-18 11:55:48 -05:00
free_page ( ( unsigned long ) mmu - > pml5_root ) ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2020-09-23 09:33:14 -07:00
static int __kvm_mmu_create ( struct kvm_vcpu * vcpu , struct kvm_mmu * mmu )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
2007-01-05 16:36:40 -08:00
struct page * page ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
int i ;
2022-02-21 09:28:33 -05:00
mmu - > root . hpa = INVALID_PAGE ;
mmu - > root . pgd = 0 ;
2020-09-23 09:33:14 -07:00
for ( i = 0 ; i < KVM_MMU_NUM_PREV_ROOTS ; i + + )
mmu - > prev_roots [ i ] = KVM_MMU_ROOT_INFO_INVALID ;
2021-11-18 19:08:09 +08:00
/* vcpu->arch.guest_mmu isn't used when !tdp_enabled. */
if ( ! tdp_enabled & & mmu = = & vcpu - > arch . guest_mmu )
return 0 ;
2007-01-05 16:36:40 -08:00
/*
2019-06-13 10:22:23 -07:00
* When using PAE paging , the four PDPTEs are treated as ' root ' pages ,
* while the PDP table is a per - vCPU construct that ' s allocated at MMU
* creation . When emulating 32 - bit mode , cr3 is only 32 bits even on
* x86_64 . Therefore we need to allocate the PDP table in the first
2021-03-04 17:10:46 -08:00
* 4 GB of memory , which happens to fit the DMA32 zone . TDP paging
* generally doesn ' t use PAE paging and can skip allocating the PDP
* table . The main exception , handled here , is SVM ' s 32 - bit NPT . The
* other exception is for shadowing L1 ' s 32 - bit or PAE NPT on 64 - bit
2021-11-18 19:08:10 +08:00
* KVM ; that horror is handled on - demand by mmu_alloc_special_roots ( ) .
2007-01-05 16:36:40 -08:00
*/
2020-07-15 20:41:20 -07:00
if ( tdp_enabled & & kvm_mmu_get_tdp_level ( vcpu ) > PT32E_ROOT_LEVEL )
2019-06-13 10:22:23 -07:00
return 0 ;
2019-02-11 11:02:50 -08:00
page = alloc_page ( GFP_KERNEL_ACCOUNT | __GFP_DMA32 ) ;
2007-01-05 16:36:40 -08:00
if ( ! page )
2010-01-22 16:55:05 +08:00
return - ENOMEM ;
2019-06-22 19:42:04 +02:00
mmu - > pae_root = page_address ( page ) ;
2021-03-09 14:42:07 -08:00
/*
* CR3 is only 32 bits when PAE paging is used , thus it ' s impossible to
* get the CPU to treat the PDPTEs as encrypted . Decrypt the page so
* that KVM ' s writes and the CPU ' s reads get along . Note , this is
* only necessary when using shadow paging , as 64 - bit NPT can get at
* the C - bit even when shadowing 32 - bit NPT , and SME isn ' t supported
* by 32 - bit kernels ( when KVM itself uses 32 - bit NPT ) .
*/
if ( ! tdp_enabled )
set_memory_decrypted ( ( unsigned long ) mmu - > pae_root , 1 ) ;
else
KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits. Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits. TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either. Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.
KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory. MKTME KeyID bits should be set
to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check. So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).
Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:
- shadow_me_value: the memory encryption bit(s) that will be set to the
SPTE (the original shadow_me_mask).
- shadow_me_mask: all possible memory encryption bits (which is a super
set of shadow_me_value).
- For now, shadow_me_value is supposed to be set by SVM and VMX
respectively, and it is a constant during KVM's life time. This
perhaps doesn't fit MKTME but for now host kernel doesn't support it
(and perhaps will never do).
- Bits in shadow_me_mask are set to shadow_zero_check, except the bits
in shadow_me_value.
Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed. This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-19 23:17:03 +12:00
WARN_ON_ONCE ( shadow_me_value ) ;
2021-03-09 14:42:07 -08:00
2007-01-05 16:36:40 -08:00
for ( i = 0 ; i < 4 ; + + i )
2021-03-09 14:42:06 -08:00
mmu - > pae_root [ i ] = INVALID_PAE_ROOT ;
2007-01-05 16:36:40 -08:00
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
return 0 ;
}
2006-12-29 16:50:01 -08:00
int kvm_mmu_create ( struct kvm_vcpu * vcpu )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
2019-06-22 19:42:04 +02:00
int ret ;
2018-06-27 14:59:20 -07:00
2020-07-02 19:35:25 -07:00
vcpu - > arch . mmu_pte_list_desc_cache . kmem_cache = pte_list_desc_cache ;
2020-07-02 19:35:34 -07:00
vcpu - > arch . mmu_pte_list_desc_cache . gfp_zero = __GFP_ZERO ;
2020-07-02 19:35:25 -07:00
vcpu - > arch . mmu_page_header_cache . kmem_cache = mmu_page_header_cache ;
2020-07-02 19:35:34 -07:00
vcpu - > arch . mmu_page_header_cache . gfp_zero = __GFP_ZERO ;
2020-07-02 19:35:25 -07:00
2020-07-02 19:35:35 -07:00
vcpu - > arch . mmu_shadow_page_cache . gfp_zero = __GFP_ZERO ;
2018-10-08 21:28:05 +02:00
vcpu - > arch . mmu = & vcpu - > arch . root_mmu ;
vcpu - > arch . walk_mmu = & vcpu - > arch . root_mmu ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
2020-09-23 09:33:14 -07:00
ret = __kvm_mmu_create ( vcpu , & vcpu - > arch . guest_mmu ) ;
2019-06-22 19:42:04 +02:00
if ( ret )
return ret ;
2020-09-23 09:33:14 -07:00
ret = __kvm_mmu_create ( vcpu , & vcpu - > arch . root_mmu ) ;
2019-06-22 19:42:04 +02:00
if ( ret )
goto fail_allocate_root ;
return ret ;
fail_allocate_root :
free_mmu_pages ( & vcpu - > arch . guest_mmu ) ;
return ret ;
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2019-09-12 19:46:07 -07:00
# define BATCH_ZAP_PAGES 10
2019-09-12 19:46:02 -07:00
static void kvm_zap_obsolete_pages ( struct kvm * kvm )
{
struct kvm_mmu_page * sp , * node ;
2019-09-12 19:46:07 -07:00
int nr_zapped , batch = 0 ;
2022-05-11 14:51:22 +00:00
bool unstable ;
2019-09-12 19:46:02 -07:00
restart :
list_for_each_entry_safe_reverse ( sp , node ,
& kvm - > arch . active_mmu_pages , link ) {
/*
* No obsolete valid page exists before a newly created page
* since active_mmu_pages is a FIFO list .
*/
if ( ! is_obsolete_sp ( kvm , sp ) )
break ;
/*
2020-06-23 12:35:39 -07:00
* Invalid pages should never land back on the list of active
* pages . Skip the bogus page , otherwise we ' ll get stuck in an
* infinite loop if the page gets put back on the list ( again ) .
2019-09-12 19:46:02 -07:00
*/
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
if ( WARN_ON_ONCE ( sp - > role . invalid ) )
2019-09-12 19:46:02 -07:00
continue ;
2019-09-12 19:46:08 -07:00
/*
* No need to flush the TLB since we ' re only zapping shadow
* pages with an obsolete generation number and all vCPUS have
* loaded a new root , i . e . the shadow pages being zapped cannot
* be in active use by the guest .
*/
2019-09-12 19:46:07 -07:00
if ( batch > = BATCH_ZAP_PAGES & &
2021-02-02 10:57:24 -08:00
cond_resched_rwlock_write ( & kvm - > mmu_lock ) ) {
2019-09-12 19:46:07 -07:00
batch = 0 ;
2019-09-12 19:46:02 -07:00
goto restart ;
}
2022-05-11 14:51:22 +00:00
unstable = __kvm_mmu_prepare_zap_page ( kvm , sp ,
& kvm - > arch . zapped_obsolete_pages , & nr_zapped ) ;
batch + = nr_zapped ;
if ( unstable )
2019-09-12 19:46:02 -07:00
goto restart ;
}
2019-09-12 19:46:08 -07:00
/*
2022-02-26 00:15:23 +00:00
* Kick all vCPUs ( via remote TLB flush ) before freeing the page tables
* to ensure KVM is not in the middle of a lockless shadow page table
* walk , which may reference the pages . The remote TLB flush itself is
* not required and is simply a convenient way to kick vCPUs as needed .
* KVM performs a local TLB flush when allocating a new root ( see
* kvm_mmu_load ( ) ) , and the reload in the caller ensure no vCPUs are
* running with an obsolete MMU .
2019-09-12 19:46:08 -07:00
*/
2019-09-12 19:46:10 -07:00
kvm_mmu_commit_zap_page ( kvm , & kvm - > arch . zapped_obsolete_pages ) ;
2019-09-12 19:46:02 -07:00
}
/*
* Fast invalidate all shadow pages and use lock - break technique
* to zap obsolete pages .
*
* It ' s required when memslot is being deleted or VM is being
* destroyed , in these cases , we should ensure that KVM MMU does
* not use any resource of the being - deleted slot or all slots
* after calling the function .
*/
static void kvm_mmu_zap_all_fast ( struct kvm * kvm )
{
2019-09-12 19:46:11 -07:00
lockdep_assert_held ( & kvm - > slots_lock ) ;
2021-02-02 10:57:24 -08:00
write_lock ( & kvm - > mmu_lock ) ;
2019-09-12 19:46:06 -07:00
trace_kvm_mmu_zap_all_fast ( kvm ) ;
2019-09-12 19:46:11 -07:00
/*
* Toggle mmu_valid_gen between ' 0 ' and ' 1 ' . Because slots_lock is
* held for the entire duration of zapping obsolete pages , it ' s
* impossible for there to be multiple invalid generations associated
* with * valid * shadow pages at any given time , i . e . there is exactly
* one valid generation and ( at most ) one invalid generation .
*/
kvm - > arch . mmu_valid_gen = kvm - > arch . mmu_valid_gen ? 0 : 1 ;
2019-09-12 19:46:02 -07:00
2022-02-25 18:22:44 +00:00
/*
* In order to ensure all vCPUs drop their soon - to - be invalid roots ,
* invalidating TDP MMU roots must be done while holding mmu_lock for
* write and in the same critical section as making the reload request ,
* e . g . before kvm_zap_obsolete_pages ( ) could drop mmu_lock and yield .
2021-04-01 16:37:35 -07:00
*/
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled )
2021-04-01 16:37:35 -07:00
kvm_tdp_mmu_invalidate_all_roots ( kvm ) ;
2019-09-12 19:46:08 -07:00
/*
* Notify all vcpus to reload its shadow page table and flush TLB .
* Then all vcpus will switch to new shadow page table with the new
* mmu_valid_gen .
*
* Note : we need to do this under the protection of mmu_lock ,
* otherwise , vcpu would purge shadow page but miss tlb flush .
*/
2022-02-25 18:22:45 +00:00
kvm_make_all_cpus_request ( kvm , KVM_REQ_MMU_FREE_OBSOLETE_ROOTS ) ;
2019-09-12 19:46:08 -07:00
2019-09-12 19:46:02 -07:00
kvm_zap_obsolete_pages ( kvm ) ;
2020-10-14 11:26:47 -07:00
2021-02-02 10:57:24 -08:00
write_unlock ( & kvm - > mmu_lock ) ;
2021-04-01 16:37:36 -07:00
2022-02-26 00:15:21 +00:00
/*
* Zap the invalidated TDP MMU roots , all SPTEs must be dropped before
* returning to the caller , e . g . if the zap is in response to a memslot
* deletion , mmu_notifier callbacks will be unable to reach the SPTEs
* associated with the deleted memslot once the update completes , and
* Deferring the zap until the final reference to the root is put would
* lead to use - after - free .
*/
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled )
2021-04-01 16:37:36 -07:00
kvm_tdp_mmu_zap_invalidated_roots ( kvm ) ;
2019-09-12 19:46:02 -07:00
}
2019-09-12 19:46:10 -07:00
static bool kvm_has_zapped_obsolete_pages ( struct kvm * kvm )
{
return unlikely ( ! list_empty_careful ( & kvm - > arch . zapped_obsolete_pages ) ) ;
}
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
Stop zapping invalidate TDP MMU roots via work queue now that KVM
preserves TDP MMU roots until they are explicitly invalidated. Zapping
roots asynchronously was effectively a workaround to avoid stalling a vCPU
for an extended during if a vCPU unloaded a root, which at the time
happened whenever the guest toggled CR0.WP (a frequent operation for some
guest kernels).
While a clever hack, zapping roots via an unbound worker had subtle,
unintended consequences on host scheduling, especially when zapping
multiple roots, e.g. as part of a memslot. Because the work of zapping a
root is no longer bound to the task that initiated the zap, things like
the CPU affinity and priority of the original task get lost. Losing the
affinity and priority can be especially problematic if unbound workqueues
aren't affined to a small number of CPUs, as zapping multiple roots can
cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
the CPUs KVM is already using to run vCPUs.
When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
zap can result in KVM occupying all logical CPUs for ~8ms, and result in
high priority tasks not being scheduled in in a timely manner. In v5.15,
which doesn't preserve unloaded roots, the issues were even more noticeable
as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.
Consuming all CPUs for an extended duration can lead to significant jitter
throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
is a semi-frequent operation as memslots are deleted and recreated with
different host virtual addresses to react to host GPU drivers allocating
and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips
during games due to the audio server's tasks not getting scheduled in
promptly, despite the tasks having a high realtime priority.
Deleting memslots isn't exactly a fast path and should be avoided when
possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
memslot shenanigans, but KVM is squarely in the wrong. Not to mention
that removing the async zapping eliminates a non-trivial amount of
complexity.
Note, one of the subtle behaviors hidden behind the async zapping is that
KVM would zap invalidated roots only once (ignoring partial zaps from
things like mmu_notifier events). Preserve this behavior by adding a flag
to identify roots that are scheduled to be zapped versus roots that have
already been zapped but not yet freed.
Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
encounter invalid roots, as it's not at all obvious why zapping
invalidated roots shouldn't simply zap all invalid roots.
Reported-by: Pattara Teerapong <pteerapong@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Yiwei Zhang<zzyiwei@google.com>
Cc: Paul Hsia <paulhsia@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230916003916.2545000-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-15 17:39:15 -07:00
void kvm_mmu_init_vm ( struct kvm * kvm )
2015-05-13 14:42:23 +08:00
{
2022-03-25 12:42:52 -04:00
INIT_LIST_HEAD ( & kvm - > arch . active_mmu_pages ) ;
INIT_LIST_HEAD ( & kvm - > arch . zapped_obsolete_pages ) ;
2022-10-19 16:56:12 +00:00
INIT_LIST_HEAD ( & kvm - > arch . possible_nx_huge_pages ) ;
KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-12 11:18:15 -07:00
spin_lock_init ( & kvm - > arch . mmu_unsync_pages_lock ) ;
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
Stop zapping invalidate TDP MMU roots via work queue now that KVM
preserves TDP MMU roots until they are explicitly invalidated. Zapping
roots asynchronously was effectively a workaround to avoid stalling a vCPU
for an extended during if a vCPU unloaded a root, which at the time
happened whenever the guest toggled CR0.WP (a frequent operation for some
guest kernels).
While a clever hack, zapping roots via an unbound worker had subtle,
unintended consequences on host scheduling, especially when zapping
multiple roots, e.g. as part of a memslot. Because the work of zapping a
root is no longer bound to the task that initiated the zap, things like
the CPU affinity and priority of the original task get lost. Losing the
affinity and priority can be especially problematic if unbound workqueues
aren't affined to a small number of CPUs, as zapping multiple roots can
cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
the CPUs KVM is already using to run vCPUs.
When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
zap can result in KVM occupying all logical CPUs for ~8ms, and result in
high priority tasks not being scheduled in in a timely manner. In v5.15,
which doesn't preserve unloaded roots, the issues were even more noticeable
as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.
Consuming all CPUs for an extended duration can lead to significant jitter
throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
is a semi-frequent operation as memslots are deleted and recreated with
different host virtual addresses to react to host GPU drivers allocating
and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips
during games due to the audio server's tasks not getting scheduled in
promptly, despite the tasks having a high realtime priority.
Deleting memslots isn't exactly a fast path and should be avoided when
possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
memslot shenanigans, but KVM is squarely in the wrong. Not to mention
that removing the async zapping eliminates a non-trivial amount of
complexity.
Note, one of the subtle behaviors hidden behind the async zapping is that
KVM would zap invalidated roots only once (ignoring partial zaps from
things like mmu_notifier events). Preserve this behavior by adding a flag
to identify roots that are scheduled to be zapped versus roots that have
already been zapped but not yet freed.
Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
encounter invalid roots, as it's not at all obvious why zapping
invalidated roots shouldn't simply zap all invalid roots.
Reported-by: Pattara Teerapong <pteerapong@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Yiwei Zhang<zzyiwei@google.com>
Cc: Paul Hsia <paulhsia@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230916003916.2545000-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-15 17:39:15 -07:00
if ( tdp_mmu_enabled )
kvm_mmu_init_tdp_mmu ( kvm ) ;
2020-10-14 11:26:43 -07:00
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
kvm - > arch . split_page_header_cache . kmem_cache = mmu_page_header_cache ;
kvm - > arch . split_page_header_cache . gfp_zero = __GFP_ZERO ;
kvm - > arch . split_shadow_page_cache . gfp_zero = __GFP_ZERO ;
kvm - > arch . split_desc_cache . kmem_cache = pte_list_desc_cache ;
kvm - > arch . split_desc_cache . gfp_zero = __GFP_ZERO ;
2015-05-13 14:42:23 +08:00
}
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
static void mmu_free_vm_memory_caches ( struct kvm * kvm )
{
kvm_mmu_free_memory_cache ( & kvm - > arch . split_desc_cache ) ;
kvm_mmu_free_memory_cache ( & kvm - > arch . split_page_header_cache ) ;
kvm_mmu_free_memory_cache ( & kvm - > arch . split_shadow_page_cache ) ;
}
2016-02-24 17:51:16 +08:00
void kvm_mmu_uninit_vm ( struct kvm * kvm )
2015-05-13 14:42:23 +08:00
{
2022-09-21 10:35:38 -07:00
if ( tdp_mmu_enabled )
kvm_mmu_uninit_tdp_mmu ( kvm ) ;
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
mmu_free_vm_memory_caches ( kvm ) ;
2015-05-13 14:42:23 +08:00
}
2022-07-15 22:42:23 +00:00
static bool kvm_rmap_zap_gfn_range ( struct kvm * kvm , gfn_t gfn_start , gfn_t gfn_end )
2021-10-21 18:00:05 -07:00
{
const struct kvm_memory_slot * memslot ;
struct kvm_memslots * slots ;
2021-12-06 20:54:32 +01:00
struct kvm_memslot_iter iter ;
2021-10-21 18:00:05 -07:00
bool flush = false ;
gfn_t start , end ;
2021-12-06 20:54:32 +01:00
int i ;
2021-10-21 18:00:05 -07:00
if ( ! kvm_memslots_have_rmaps ( kvm ) )
return flush ;
2023-10-27 11:22:04 -07:00
for ( i = 0 ; i < kvm_arch_nr_memslot_as_ids ( kvm ) ; i + + ) {
2021-10-21 18:00:05 -07:00
slots = __kvm_memslots ( kvm , i ) ;
2021-12-06 20:54:32 +01:00
kvm_for_each_memslot_in_gfn_range ( & iter , slots , gfn_start , gfn_end ) {
memslot = iter . slot ;
2021-10-21 18:00:05 -07:00
start = max ( gfn_start , memslot - > base_gfn ) ;
end = min ( gfn_end , memslot - > base_gfn + memslot - > npages ) ;
2021-12-06 20:54:32 +01:00
if ( WARN_ON_ONCE ( start > = end ) )
2021-10-21 18:00:05 -07:00
continue ;
2023-02-02 18:27:49 +00:00
flush = __walk_slot_rmaps ( kvm , memslot , __kvm_zap_rmap ,
PG_LEVEL_4K , KVM_MAX_HUGEPAGE_LEVEL ,
start , end - 1 , true , flush ) ;
2021-10-21 18:00:05 -07:00
}
}
return flush ;
}
2021-08-10 23:52:38 +03:00
/*
* Invalidate ( zap ) SPTEs that cover GFNs from gfn_start and up to gfn_end
* ( not including it )
*/
2015-05-13 14:42:27 +08:00
void kvm_zap_gfn_range ( struct kvm * kvm , gfn_t gfn_start , gfn_t gfn_end )
{
2021-10-21 18:00:05 -07:00
bool flush ;
2015-05-13 14:42:27 +08:00
2021-12-06 20:54:32 +01:00
if ( WARN_ON_ONCE ( gfn_end < = gfn_start ) )
return ;
2021-08-10 23:52:36 +03:00
write_lock ( & kvm - > mmu_lock ) ;
2023-10-27 11:21:45 -07:00
kvm_mmu_invalidate_begin ( kvm ) ;
kvm_mmu_invalidate_range_add ( kvm , gfn_start , gfn_end ) ;
2021-08-10 23:52:39 +03:00
2022-07-15 22:42:23 +00:00
flush = kvm_rmap_zap_gfn_range ( kvm , gfn_start , gfn_end ) ;
2015-05-13 14:42:27 +08:00
2023-09-21 05:44:56 -04:00
if ( tdp_mmu_enabled )
flush = kvm_tdp_mmu_zap_leafs ( kvm , gfn_start , gfn_end , flush ) ;
2021-08-10 23:52:36 +03:00
if ( flush )
2023-01-26 10:40:22 -08:00
kvm_flush_remote_tlbs_range ( kvm , gfn_start , gfn_end - gfn_start ) ;
2021-08-10 23:52:36 +03:00
2023-10-27 11:21:45 -07:00
kvm_mmu_invalidate_end ( kvm ) ;
2021-08-10 23:52:39 +03:00
2021-08-10 23:52:36 +03:00
write_unlock ( & kvm - > mmu_lock ) ;
2015-05-13 14:42:27 +08:00
}
2015-11-20 17:41:28 +09:00
static bool slot_rmap_write_protect ( struct kvm * kvm ,
2021-02-12 16:50:05 -08:00
struct kvm_rmap_head * rmap_head ,
2021-07-12 22:33:38 -04:00
const struct kvm_memory_slot * slot )
2015-05-13 14:42:24 +08:00
{
2022-01-19 23:07:23 +00:00
return rmap_write_protect ( rmap_head , false ) ;
2015-05-13 14:42:24 +08:00
}
2015-01-28 10:54:26 +08:00
void kvm_mmu_slot_remove_write_access ( struct kvm * kvm ,
2021-07-12 22:33:38 -04:00
const struct kvm_memory_slot * memslot ,
2020-02-27 09:32:27 +08:00
int start_level )
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
{
2021-05-18 10:34:13 -07:00
if ( kvm_memslots_have_rmaps ( kvm ) ) {
write_lock ( & kvm - > mmu_lock ) ;
2023-02-02 18:27:49 +00:00
walk_slot_rmaps ( kvm , memslot , slot_rmap_write_protect ,
start_level , KVM_MAX_HUGEPAGE_LEVEL , false ) ;
2021-05-18 10:34:13 -07:00
write_unlock ( & kvm - > mmu_lock ) ;
}
2014-04-17 17:06:16 +08:00
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled ) {
2021-04-01 16:37:34 -07:00
read_lock ( & kvm - > mmu_lock ) ;
kvm: x86: mmu: Always flush TLBs when enabling dirty logging
When A/D bits are not available, KVM uses a software access tracking
mechanism, which involves making the SPTEs inaccessible. However,
the clear_young() MMU notifier does not flush TLBs. So it is possible
that there may still be stale, potentially writable, TLB entries.
This is usually fine, but can be problematic when enabling dirty
logging, because it currently only does a TLB flush if any SPTEs were
modified. But if all SPTEs are in access-tracked state, then there
won't be a TLB flush, which means that the guest could still possibly
write to memory and not have it reflected in the dirty bitmap.
So just unconditionally flush the TLBs when enabling dirty logging.
As an alternative, KVM could explicitly check the MMU-Writable bit when
write-protecting SPTEs to decide if a flush is needed (instead of
checking the Writable bit), but given that a flush almost always happens
anyway, so just making it unconditional seems simpler.
Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20220810224939.2611160-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:49:39 -07:00
kvm_tdp_mmu_wrprot_slot ( kvm , memslot , start_level ) ;
2021-04-01 16:37:34 -07:00
read_unlock ( & kvm - > mmu_lock ) ;
}
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 02:21:36 -08:00
}
2007-01-05 16:36:56 -08:00
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
static inline bool need_topup ( struct kvm_mmu_memory_cache * cache , int min )
{
return kvm_mmu_memory_cache_nr_free_objects ( cache ) < min ;
}
static bool need_topup_split_caches_or_resched ( struct kvm * kvm )
{
if ( need_resched ( ) | | rwlock_needbreak ( & kvm - > mmu_lock ) )
return true ;
/*
* In the worst case , SPLIT_DESC_CACHE_MIN_NR_OBJECTS descriptors are needed
* to split a single huge page . Calculating how many are actually needed
* is possible but not worth the complexity .
*/
return need_topup ( & kvm - > arch . split_desc_cache , SPLIT_DESC_CACHE_MIN_NR_OBJECTS ) | |
need_topup ( & kvm - > arch . split_page_header_cache , 1 ) | |
need_topup ( & kvm - > arch . split_shadow_page_cache , 1 ) ;
}
static int topup_split_caches ( struct kvm * kvm )
{
2022-06-24 17:18:08 +00:00
/*
* Allocating rmap list entries when splitting huge pages for nested
2022-07-12 02:07:24 +00:00
* MMUs is uncommon as KVM needs to use a list if and only if there is
2022-06-24 17:18:08 +00:00
* more than one rmap entry for a gfn , i . e . requires an L1 gfn to be
2022-07-12 02:07:24 +00:00
* aliased by multiple L2 gfns and / or from multiple nested roots with
* different roles . Aliasing gfns when using TDP is atypical for VMMs ;
* a few gfns are often aliased during boot , e . g . when remapping BIOS ,
* but aliasing rarely occurs post - boot or for many gfns . If there is
* only one rmap entry , rmap - > val points directly at that one entry and
* doesn ' t need to allocate a list . Buffer the cache by the default
* capacity so that KVM doesn ' t have to drop mmu_lock to topup if KVM
2022-06-24 17:18:08 +00:00
* encounters an aliased gfn or two .
*/
const int capacity = SPLIT_DESC_CACHE_MIN_NR_OBJECTS +
KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE ;
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
int r ;
lockdep_assert_held ( & kvm - > slots_lock ) ;
2022-06-24 17:18:08 +00:00
r = __kvm_mmu_topup_memory_cache ( & kvm - > arch . split_desc_cache , capacity ,
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
SPLIT_DESC_CACHE_MIN_NR_OBJECTS ) ;
if ( r )
return r ;
r = kvm_mmu_topup_memory_cache ( & kvm - > arch . split_page_header_cache , 1 ) ;
if ( r )
return r ;
return kvm_mmu_topup_memory_cache ( & kvm - > arch . split_shadow_page_cache , 1 ) ;
}
static struct kvm_mmu_page * shadow_mmu_get_sp_for_split ( struct kvm * kvm , u64 * huge_sptep )
{
struct kvm_mmu_page * huge_sp = sptep_to_sp ( huge_sptep ) ;
struct shadow_page_caches caches = { } ;
union kvm_mmu_page_role role ;
unsigned int access ;
gfn_t gfn ;
2022-07-12 02:07:22 +00:00
gfn = kvm_mmu_page_get_gfn ( huge_sp , spte_index ( huge_sptep ) ) ;
access = kvm_mmu_page_get_access ( huge_sp , spte_index ( huge_sptep ) ) ;
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
/*
* Note , huge page splitting always uses direct shadow pages , regardless
* of whether the huge page itself is mapped by a direct or indirect
* shadow page , since the huge page region itself is being directly
* mapped with smaller pages .
*/
role = kvm_mmu_child_role ( huge_sptep , /*direct=*/ true , access ) ;
/* Direct SPs do not require a shadowed_info_cache. */
caches . page_header_cache = & kvm - > arch . split_page_header_cache ;
caches . shadow_page_cache = & kvm - > arch . split_shadow_page_cache ;
/* Safe to pass NULL for vCPU since requesting a direct SP. */
return __kvm_mmu_get_shadow_page ( kvm , NULL , & caches , gfn , role ) ;
}
static void shadow_mmu_split_huge_page ( struct kvm * kvm ,
const struct kvm_memory_slot * slot ,
u64 * huge_sptep )
{
struct kvm_mmu_memory_cache * cache = & kvm - > arch . split_desc_cache ;
u64 huge_spte = READ_ONCE ( * huge_sptep ) ;
struct kvm_mmu_page * sp ;
2022-06-22 15:27:10 -04:00
bool flush = false ;
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
u64 * sptep , spte ;
gfn_t gfn ;
int index ;
sp = shadow_mmu_get_sp_for_split ( kvm , huge_sptep ) ;
for ( index = 0 ; index < SPTE_ENT_PER_PAGE ; index + + ) {
sptep = & sp - > spt [ index ] ;
gfn = kvm_mmu_page_get_gfn ( sp , index ) ;
/*
* The SP may already have populated SPTEs , e . g . if this huge
* page is aliased by multiple sptes with the same access
* permissions . These entries are guaranteed to map the same
* gfn - to - pfn translation since the SP is direct , so no need to
* modify them .
*
2022-06-22 15:27:10 -04:00
* However , if a given SPTE points to a lower level page table ,
* that lower level page table may only be partially populated .
* Installing such SPTEs would effectively unmap a potion of the
* huge page . Unmapping guest memory always requires a TLB flush
* since a subsequent operation on the unmapped regions would
* fail to detect the need to flush .
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
*/
2022-06-22 15:27:10 -04:00
if ( is_shadow_present_pte ( * sptep ) ) {
flush | = ! is_last_spte ( * sptep , sp - > role . level ) ;
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
continue ;
2022-06-22 15:27:10 -04:00
}
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
spte = make_huge_page_split_spte ( kvm , huge_spte , sp - > role , index ) ;
mmu_spte_set ( sptep , spte ) ;
__rmap_add ( kvm , cache , slot , sptep , gfn , sp - > role . access ) ;
}
2022-06-22 15:27:10 -04:00
__link_shadow_page ( kvm , cache , huge_sptep , sp , flush ) ;
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
}
static int shadow_mmu_try_split_huge_page ( struct kvm * kvm ,
const struct kvm_memory_slot * slot ,
u64 * huge_sptep )
{
struct kvm_mmu_page * huge_sp = sptep_to_sp ( huge_sptep ) ;
int level , r = 0 ;
gfn_t gfn ;
u64 spte ;
/* Grab information for the tracepoint before dropping the MMU lock. */
2022-07-12 02:07:22 +00:00
gfn = kvm_mmu_page_get_gfn ( huge_sp , spte_index ( huge_sptep ) ) ;
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
level = huge_sp - > role . level ;
spte = * huge_sptep ;
if ( kvm_mmu_available_pages ( kvm ) < = KVM_MIN_FREE_MMU_PAGES ) {
r = - ENOSPC ;
goto out ;
}
if ( need_topup_split_caches_or_resched ( kvm ) ) {
write_unlock ( & kvm - > mmu_lock ) ;
cond_resched ( ) ;
/*
* If the topup succeeds , return - EAGAIN to indicate that the
* rmap iterator should be restarted because the MMU lock was
* dropped .
*/
r = topup_split_caches ( kvm ) ? : - EAGAIN ;
write_lock ( & kvm - > mmu_lock ) ;
goto out ;
}
shadow_mmu_split_huge_page ( kvm , slot , huge_sptep ) ;
out :
trace_kvm_mmu_split_huge_page ( gfn , spte , level , r ) ;
return r ;
}
static bool shadow_mmu_try_split_huge_pages ( struct kvm * kvm ,
struct kvm_rmap_head * rmap_head ,
const struct kvm_memory_slot * slot )
{
struct rmap_iterator iter ;
struct kvm_mmu_page * sp ;
u64 * huge_sptep ;
int r ;
restart :
for_each_rmap_spte ( rmap_head , & iter , huge_sptep ) {
sp = sptep_to_sp ( huge_sptep ) ;
/* TDP MMU is enabled, so rmap only contains nested MMU SPs. */
if ( WARN_ON_ONCE ( ! sp - > role . guest_mode ) )
continue ;
/* The rmaps should never contain non-leaf SPTEs. */
if ( WARN_ON_ONCE ( ! is_large_pte ( * huge_sptep ) ) )
continue ;
/* SPs with level >PG_LEVEL_4K should never by unsync. */
if ( WARN_ON_ONCE ( sp - > unsync ) )
continue ;
/* Don't bother splitting huge pages on invalid SPs. */
if ( sp - > role . invalid )
continue ;
r = shadow_mmu_try_split_huge_page ( kvm , slot , huge_sptep ) ;
/*
* The split succeeded or needs to be retried because the MMU
* lock was dropped . Either way , restart the iterator to get it
* back into a consistent state .
*/
if ( ! r | | r = = - EAGAIN )
goto restart ;
/* The split failed and shouldn't be retried (e.g. -ENOMEM). */
break ;
}
return false ;
}
static void kvm_shadow_mmu_try_split_huge_pages ( struct kvm * kvm ,
const struct kvm_memory_slot * slot ,
gfn_t start , gfn_t end ,
int target_level )
{
int level ;
/*
* Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
* down to the target level . This ensures pages are recursively split
* all the way to the target level . There ' s no need to split pages
* already at the target level .
*/
2023-02-02 18:27:49 +00:00
for ( level = KVM_MAX_HUGEPAGE_LEVEL ; level > target_level ; level - - )
__walk_slot_rmaps ( kvm , slot , shadow_mmu_try_split_huge_pages ,
level , level , start , end - 1 , true , false ) ;
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
}
2022-01-19 23:07:37 +00:00
/* Must be called with the mmu_lock held in write-mode. */
void kvm_mmu_try_split_huge_pages ( struct kvm * kvm ,
const struct kvm_memory_slot * memslot ,
u64 start , u64 end ,
int target_level )
{
2022-09-21 10:35:37 -07:00
if ( ! tdp_mmu_enabled )
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
return ;
if ( kvm_memslots_have_rmaps ( kvm ) )
kvm_shadow_mmu_try_split_huge_pages ( kvm , memslot , start , end , target_level ) ;
kvm_tdp_mmu_try_split_huge_pages ( kvm , memslot , start , end , target_level , false ) ;
2022-01-19 23:07:37 +00:00
/*
2024-01-02 18:40:11 -06:00
* A TLB flush is unnecessary at this point for the same reasons as in
2022-01-19 23:07:37 +00:00
* kvm_mmu_slot_try_split_huge_pages ( ) .
*/
}
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
void kvm_mmu_slot_try_split_huge_pages ( struct kvm * kvm ,
2022-01-19 23:07:37 +00:00
const struct kvm_memory_slot * memslot ,
int target_level )
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
{
u64 start = memslot - > base_gfn ;
u64 end = start + memslot - > npages ;
2022-09-21 10:35:37 -07:00
if ( ! tdp_mmu_enabled )
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
return ;
if ( kvm_memslots_have_rmaps ( kvm ) ) {
write_lock ( & kvm - > mmu_lock ) ;
kvm_shadow_mmu_try_split_huge_pages ( kvm , memslot , start , end , target_level ) ;
write_unlock ( & kvm - > mmu_lock ) ;
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
}
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
read_lock ( & kvm - > mmu_lock ) ;
kvm_tdp_mmu_try_split_huge_pages ( kvm , memslot , start , end , target_level , true ) ;
read_unlock ( & kvm - > mmu_lock ) ;
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
/*
* No TLB flush is necessary here . KVM will flush TLBs after
* write - protecting and / or clearing dirty on the newly split SPTEs to
* ensure that guest writes are reflected in the dirty log before the
* ioctl to enable dirty logging on this memslot completes . Since the
* split SPTEs retain the write and dirty bits of the huge SPTE , it is
* safe for KVM to decide if a TLB flush is necessary based on the split
* SPTEs .
*/
}
2015-04-03 15:40:25 +08:00
static bool kvm_mmu_zap_collapsible_spte ( struct kvm * kvm ,
2021-02-12 16:50:05 -08:00
struct kvm_rmap_head * rmap_head ,
2021-07-12 22:33:38 -04:00
const struct kvm_memory_slot * slot )
2015-04-03 15:40:25 +08:00
{
u64 * sptep ;
struct rmap_iterator iter ;
int need_tlb_flush = 0 ;
struct kvm_mmu_page * sp ;
2015-05-13 14:42:20 +08:00
restart :
2015-11-20 17:41:28 +09:00
for_each_rmap_spte ( rmap_head , & iter , sptep ) {
2020-06-22 13:20:33 -07:00
sp = sptep_to_sp ( sptep ) ;
2015-04-03 15:40:25 +08:00
/*
2015-04-14 12:04:10 +08:00
* We cannot do huge page mapping for indirect shadow pages ,
* which are found on the last rmap ( level = 1 ) when not using
* tdp ; such shadow pages are synced with the page table in
* the guest , and the guest page table is using 4 K page size
* mapping if the indirect sp has level = 1.
2015-04-03 15:40:25 +08:00
*/
2022-04-29 01:04:16 +00:00
if ( sp - > role . direct & &
2021-02-12 16:50:06 -08:00
sp - > role . level < kvm_mmu_max_mapping_level ( kvm , slot , sp - > gfn ,
2022-07-15 23:21:04 +00:00
PG_LEVEL_NUM ) ) {
2022-07-15 22:42:25 +00:00
kvm_zap_one_rmap_spte ( kvm , rmap_head , sptep ) ;
2018-12-06 21:21:08 +08:00
2023-04-04 17:31:32 -07:00
if ( kvm_available_flush_remote_tlbs_range ( ) )
2022-10-10 20:19:15 +08:00
kvm_flush_remote_tlbs_sptep ( kvm , sptep ) ;
2018-12-06 21:21:08 +08:00
else
need_tlb_flush = 1 ;
2015-05-13 14:42:20 +08:00
goto restart ;
}
2015-04-03 15:40:25 +08:00
}
return need_tlb_flush ;
}
2022-06-22 15:27:06 -04:00
static void kvm_rmap_zap_collapsible_sptes ( struct kvm * kvm ,
const struct kvm_memory_slot * slot )
{
/*
* Note , use KVM_MAX_HUGEPAGE_LEVEL - 1 since there ' s no need to zap
* pages that are already mapped at the maximum hugepage level .
*/
2023-02-02 18:27:49 +00:00
if ( walk_slot_rmaps ( kvm , slot , kvm_mmu_zap_collapsible_spte ,
PG_LEVEL_4K , KVM_MAX_HUGEPAGE_LEVEL - 1 , true ) )
2023-08-11 04:51:19 +00:00
kvm_flush_remote_tlbs_memslot ( kvm , slot ) ;
2022-06-22 15:27:06 -04:00
}
2015-04-03 15:40:25 +08:00
void kvm_mmu_zap_collapsible_sptes ( struct kvm * kvm ,
2021-07-12 22:33:38 -04:00
const struct kvm_memory_slot * slot )
2015-04-03 15:40:25 +08:00
{
2021-05-18 10:34:13 -07:00
if ( kvm_memslots_have_rmaps ( kvm ) ) {
write_lock ( & kvm - > mmu_lock ) ;
2022-06-22 15:27:06 -04:00
kvm_rmap_zap_collapsible_sptes ( kvm , slot ) ;
2021-05-18 10:34:13 -07:00
write_unlock ( & kvm - > mmu_lock ) ;
}
2021-04-01 16:37:33 -07:00
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled ) {
2021-04-01 16:37:33 -07:00
read_lock ( & kvm - > mmu_lock ) ;
2021-11-20 04:50:21 +00:00
kvm_tdp_mmu_zap_collapsible_sptes ( kvm , slot ) ;
2021-04-01 16:37:33 -07:00
read_unlock ( & kvm - > mmu_lock ) ;
}
2015-04-03 15:40:25 +08:00
}
2015-01-28 10:54:24 +08:00
void kvm_mmu_slot_leaf_clear_dirty ( struct kvm * kvm ,
2021-07-12 22:33:38 -04:00
const struct kvm_memory_slot * memslot )
2015-01-28 10:54:24 +08:00
{
2021-05-18 10:34:13 -07:00
if ( kvm_memslots_have_rmaps ( kvm ) ) {
write_lock ( & kvm - > mmu_lock ) ;
2021-10-19 16:22:23 +00:00
/*
* Clear dirty bits only on 4 k SPTEs since the legacy MMU only
* support dirty logging at a 4 k granularity .
*/
2023-02-02 18:27:49 +00:00
walk_slot_rmaps_4k ( kvm , memslot , __rmap_clear_dirty , false ) ;
2021-05-18 10:34:13 -07:00
write_unlock ( & kvm - > mmu_lock ) ;
}
2015-01-28 10:54:24 +08:00
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled ) {
2021-04-01 16:37:34 -07:00
read_lock ( & kvm - > mmu_lock ) ;
kvm: x86: mmu: Always flush TLBs when enabling dirty logging
When A/D bits are not available, KVM uses a software access tracking
mechanism, which involves making the SPTEs inaccessible. However,
the clear_young() MMU notifier does not flush TLBs. So it is possible
that there may still be stale, potentially writable, TLB entries.
This is usually fine, but can be problematic when enabling dirty
logging, because it currently only does a TLB flush if any SPTEs were
modified. But if all SPTEs are in access-tracked state, then there
won't be a TLB flush, which means that the guest could still possibly
write to memory and not have it reflected in the dirty bitmap.
So just unconditionally flush the TLBs when enabling dirty logging.
As an alternative, KVM could explicitly check the MMU-Writable bit when
write-protecting SPTEs to decide if a flush is needed (instead of
checking the Writable bit), but given that a flush almost always happens
anyway, so just making it unconditional seems simpler.
Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20220810224939.2611160-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:49:39 -07:00
kvm_tdp_mmu_clear_dirty_slot ( kvm , memslot ) ;
2021-04-01 16:37:34 -07:00
read_unlock ( & kvm - > mmu_lock ) ;
}
2015-01-28 10:54:24 +08:00
/*
kvm: x86: mmu: Always flush TLBs when enabling dirty logging
When A/D bits are not available, KVM uses a software access tracking
mechanism, which involves making the SPTEs inaccessible. However,
the clear_young() MMU notifier does not flush TLBs. So it is possible
that there may still be stale, potentially writable, TLB entries.
This is usually fine, but can be problematic when enabling dirty
logging, because it currently only does a TLB flush if any SPTEs were
modified. But if all SPTEs are in access-tracked state, then there
won't be a TLB flush, which means that the guest could still possibly
write to memory and not have it reflected in the dirty bitmap.
So just unconditionally flush the TLBs when enabling dirty logging.
As an alternative, KVM could explicitly check the MMU-Writable bit when
write-protecting SPTEs to decide if a flush is needed (instead of
checking the Writable bit), but given that a flush almost always happens
anyway, so just making it unconditional seems simpler.
Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20220810224939.2611160-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:49:39 -07:00
* The caller will flush the TLBs after this function returns .
*
2015-01-28 10:54:24 +08:00
* It ' s also safe to flush TLBs out of mmu lock here as currently this
* function is only used for dirty logging , in which case flushing TLB
* out of mmu lock also guarantees no dirty pages will be lost in
* dirty_bitmap .
*/
}
2023-07-28 18:35:18 -07:00
static void kvm_mmu_zap_all ( struct kvm * kvm )
2013-05-31 08:36:22 +08:00
{
struct kvm_mmu_page * sp , * node ;
2019-02-05 13:01:31 -08:00
LIST_HEAD ( invalid_list ) ;
KVM: x86/mmu: Differentiate between nr zapped and list unstable
The return value of kvm_mmu_prepare_zap_page() has evolved to become
overloaded to convey two separate pieces of information. 1) was at
least one page zapped and 2) has the list of MMU pages become unstable.
In it's original incarnation (as kvm_mmu_zap_page()), there was no
return value at all. Commit 0738541396be ("KVM: MMU: awareness of new
kvm_mmu_zap_page behaviour") added a return value in preparation for
commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"). Although
the return value was of type 'int', it was actually used as a boolean
to indicate whether or not active_mmu_pages may have become unstable due
to zapping children. Walking a list with list_for_each_entry_safe()
only protects against deleting/moving the current entry, i.e. zapping a
child page would break iteration due to modifying any number of entries.
Later, commit 60c8aec6e2c9 ("KVM: MMU: use page array in unsync walk")
modified mmu_zap_unsync_children() to return an approximation of the
number of children zapped. This was not intentional, it was simply a
side effect of how the code was written.
The unintented side affect was then morphed into an actual feature by
commit 77662e0028c7 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
path"), which modified kvm_mmu_change_mmu_pages() to use the number of
zapped pages when determining the number of MMU pages in use by the VM.
Finally, commit 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed") added the initial page to the
return value to make its behavior more consistent with what most users
would expect. Incorporating the initial parent page in the return value
of kvm_mmu_zap_page() breaks the original usage of restarting a list
walk on a non-zero return value to handle a potentially unstable list,
i.e. walks will unnecessarily restart when any page is zapped.
Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
return a boolean to indicate that the list may be unstable and move the
number of zapped children to a dedicated parameter. Since the majority
of callers to kvm_mmu_prepare_zap_page() don't care about either return
value, preserve the current definition of kvm_mmu_prepare_zap_page() by
making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page(). This
avoids having to update every call site and also provides cleaner code
for functions that only care about the number of pages zapped.
Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 13:01:35 -08:00
int ign ;
2013-05-31 08:36:22 +08:00
2021-02-02 10:57:24 -08:00
write_lock ( & kvm - > mmu_lock ) ;
2013-05-31 08:36:22 +08:00
restart :
2019-02-05 13:01:32 -08:00
list_for_each_entry_safe ( sp , node , & kvm - > arch . active_mmu_pages , link ) {
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
if ( WARN_ON_ONCE ( sp - > role . invalid ) )
2019-02-05 13:01:23 -08:00
continue ;
2019-09-12 19:46:04 -07:00
if ( __kvm_mmu_prepare_zap_page ( kvm , sp , & invalid_list , & ign ) )
2013-05-31 08:36:22 +08:00
goto restart ;
2021-02-02 10:57:24 -08:00
if ( cond_resched_rwlock_write ( & kvm - > mmu_lock ) )
2013-05-31 08:36:22 +08:00
goto restart ;
}
2019-02-05 13:01:23 -08:00
kvm_mmu_commit_zap_page ( kvm , & invalid_list ) ;
2020-10-14 11:26:47 -07:00
2022-09-21 10:35:37 -07:00
if ( tdp_mmu_enabled )
2020-10-14 11:26:47 -07:00
kvm_tdp_mmu_zap_all ( kvm ) ;
2021-02-02 10:57:24 -08:00
write_unlock ( & kvm - > mmu_lock ) ;
2013-05-31 08:36:22 +08:00
}
2023-07-28 18:35:18 -07:00
void kvm_arch_flush_shadow_all ( struct kvm * kvm )
{
kvm_mmu_zap_all ( kvm ) ;
}
void kvm_arch_flush_shadow_memslot ( struct kvm * kvm ,
struct kvm_memory_slot * slot )
{
2023-07-28 18:35:19 -07:00
kvm_mmu_zap_all_fast ( kvm ) ;
2023-07-28 18:35:18 -07:00
}
2019-02-05 12:54:17 -08:00
void kvm_mmu_invalidate_mmio_sptes ( struct kvm * kvm , u64 gen )
2013-06-07 16:51:26 +08:00
{
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28 17:47:17 -07:00
WARN_ON_ONCE ( gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS ) ;
2019-02-05 13:01:12 -08:00
2019-02-05 13:01:18 -08:00
gen & = MMIO_SPTE_GEN_MASK ;
2019-02-05 13:01:12 -08:00
2013-06-07 16:51:26 +08:00
/*
2019-02-05 13:01:12 -08:00
* Generation numbers are incremented in multiples of the number of
* address spaces in order to provide unique generations across all
* address spaces . Strip what is effectively the address space
* modifier prior to checking for a wrap of the MMIO generation so
* that a wrap in any address space is detected .
*/
2023-10-27 11:22:04 -07:00
gen & = ~ ( ( u64 ) kvm_arch_nr_memslot_as_ids ( kvm ) - 1 ) ;
2019-02-05 13:01:12 -08:00
2013-06-07 16:51:26 +08:00
/*
2019-02-05 13:01:12 -08:00
* The very rare case : if the MMIO generation number has wrapped ,
2013-06-07 16:51:26 +08:00
* zap all shadow pages .
*/
2019-02-05 13:01:12 -08:00
if ( unlikely ( gen = = 0 ) ) {
KVM: x86: Unify pr_fmt to use module name for all KVM modules
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks
use consistent formatting across common x86, Intel, and AMD code. In
addition to providing consistent print formatting, using KBUILD_MODNAME,
e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and
SGX and ...) as technologies without generating weird messages, and
without causing naming conflicts with other kernel code, e.g. "SEV: ",
"tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems.
Opportunistically move away from printk() for prints that need to be
modified anyways, e.g. to drop a manual "kvm: " prefix.
Opportunistically convert a few SGX WARNs that are similarly modified to
WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good
that they would fire repeatedly and spam the kernel log without providing
unique information in each print.
Note, defining pr_fmt yields undesirable results for code that uses KVM's
printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem
as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
wrappers is relatively limited in KVM x86 code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Message-Id: <20221130230934.1014142-35-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:18 +00:00
kvm_debug_ratelimited ( " zapping shadow pages for mmio generation wraparound \n " ) ;
2019-09-12 19:46:04 -07:00
kvm_mmu_zap_all_fast ( kvm ) ;
2013-06-21 01:34:31 +09:00
}
2013-06-07 16:51:26 +08:00
}
2023-02-02 18:27:51 +00:00
static unsigned long mmu_shrink_scan ( struct shrinker * shrink ,
struct shrink_control * sc )
2008-03-30 15:17:21 +03:00
{
struct kvm * kvm ;
2011-05-24 17:12:27 -07:00
int nr_to_scan = sc - > nr_to_scan ;
2013-08-28 10:18:14 +10:00
unsigned long freed = 0 ;
2008-03-30 15:17:21 +03:00
2019-01-03 17:14:28 -08:00
mutex_lock ( & kvm_lock ) ;
2008-03-30 15:17:21 +03:00
list_for_each_entry ( kvm , & vm_list , vm_list ) {
2011-12-02 18:35:24 +01:00
int idx ;
2010-06-04 21:55:29 +08:00
LIST_HEAD ( invalid_list ) ;
2008-03-30 15:17:21 +03:00
2012-08-20 18:35:39 +09:00
/*
* Never scan more than sc - > nr_to_scan VM instances .
* Will not hit this condition practically since we do not try
* to shrink more than one VM and it is very unlikely to see
* ! n_used_mmu_pages so many times .
*/
if ( ! nr_to_scan - - )
break ;
2012-06-04 14:53:23 +03:00
/*
* n_used_mmu_pages is accessed without holding kvm - > mmu_lock
* here . We may skip a VM instance errorneosly , but we do not
* want to shrink a VM that only started to populate its MMU
* anyway .
*/
2019-09-12 19:46:10 -07:00
if ( ! kvm - > arch . n_used_mmu_pages & &
! kvm_has_zapped_obsolete_pages ( kvm ) )
2012-06-04 14:53:23 +03:00
continue ;
2009-12-23 14:35:25 -02:00
idx = srcu_read_lock ( & kvm - > srcu ) ;
2021-02-02 10:57:24 -08:00
write_lock ( & kvm - > mmu_lock ) ;
2008-03-30 15:17:21 +03:00
2019-09-12 19:46:10 -07:00
if ( kvm_has_zapped_obsolete_pages ( kvm ) ) {
kvm_mmu_commit_zap_page ( kvm ,
& kvm - > arch . zapped_obsolete_pages ) ;
goto unlock ;
}
2020-06-23 12:35:41 -07:00
freed = kvm_mmu_zap_oldest_mmu_pages ( kvm , sc - > nr_to_scan ) ;
2012-06-04 14:53:23 +03:00
2019-09-12 19:46:10 -07:00
unlock :
2021-02-02 10:57:24 -08:00
write_unlock ( & kvm - > mmu_lock ) ;
2009-12-23 14:35:25 -02:00
srcu_read_unlock ( & kvm - > srcu , idx ) ;
2012-06-04 14:53:23 +03:00
2013-08-28 10:18:14 +10:00
/*
* unfair on small ones
* per - vm shrinkers cry out
* sadness comes quickly
*/
2012-06-04 14:53:23 +03:00
list_move_tail ( & kvm - > vm_list , & vm_list ) ;
break ;
2008-03-30 15:17:21 +03:00
}
2019-01-03 17:14:28 -08:00
mutex_unlock ( & kvm_lock ) ;
2013-08-28 10:18:14 +10:00
return freed ;
}
2023-02-02 18:27:51 +00:00
static unsigned long mmu_shrink_count ( struct shrinker * shrink ,
struct shrink_control * sc )
2013-08-28 10:18:14 +10:00
{
KVM: create aggregate kvm_total_used_mmu_pages value
Of slab shrinkers, the VM code says:
* Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
* querying the cache size, so a fastpath for that case is appropriate.
and it *means* it. Look at how it calls the shrinkers:
nr_before = (*shrinker->shrink)(0, gfp_mask);
shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
So, if you do anything stupid in your shrinker, the VM will doubly
punish you.
The mmu_shrink() function takes the global kvm_lock, then acquires
every VM's kvm->mmu_lock in sequence. If we have 100 VMs, then
we're going to take 101 locks. We do it twice, so each call takes
202 locks. If we're under memory pressure, we can have each cpu
trying to do this. It can get really hairy, and we've seen lock
spinning in mmu_shrink() be the dominant entry in profiles.
This is guaranteed to optimize at least half of those lock
aquisitions away. It removes the need to take any of the locks
when simply trying to count objects.
A 'percpu_counter' can be a large object, but we only have one
of these for the entire system. There are not any better
alternatives at the moment, especially ones that handle CPU
hotplug.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Tim Pepper <lnxninja@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-08-19 18:11:37 -07:00
return percpu_counter_read_positive ( & kvm_total_used_mmu_pages ) ;
2008-03-30 15:17:21 +03:00
}
2023-09-11 17:44:01 +08:00
static struct shrinker * mmu_shrinker ;
2008-03-30 15:17:21 +03:00
2008-05-22 10:37:48 +02:00
static void mmu_destroy_caches ( void )
2007-04-15 16:31:09 +03:00
{
2017-10-07 23:15:23 -04:00
kmem_cache_destroy ( pte_list_desc_cache ) ;
kmem_cache_destroy ( mmu_page_header_cache ) ;
2007-04-15 16:31:09 +03:00
}
2023-06-01 17:58:59 -07:00
static int get_nx_huge_pages ( char * buffer , const struct kernel_param * kp )
{
if ( nx_hugepage_mitigation_hard_disabled )
2023-06-25 15:34:38 +08:00
return sysfs_emit ( buffer , " never \n " ) ;
2023-06-01 17:58:59 -07:00
return param_get_bool ( buffer , kp ) ;
}
2019-11-04 12:22:02 +01:00
static bool get_nx_auto_mode ( void )
{
/* Return true when CPU has the bug, and mitigations are ON */
return boot_cpu_has_bug ( X86_BUG_ITLB_MULTIHIT ) & & ! cpu_mitigations_off ( ) ;
}
static void __set_nx_huge_pages ( bool val )
{
nx_huge_pages = itlb_multihit_kvm_mitigation = val ;
}
static int set_nx_huge_pages ( const char * val , const struct kernel_param * kp )
{
bool old_val = nx_huge_pages ;
bool new_val ;
2023-06-01 17:58:59 -07:00
if ( nx_hugepage_mitigation_hard_disabled )
return - EPERM ;
2019-11-04 12:22:02 +01:00
/* In "auto" mode deploy workaround only if CPU has the bug. */
2023-06-01 17:58:59 -07:00
if ( sysfs_streq ( val , " off " ) ) {
2019-11-04 12:22:02 +01:00
new_val = 0 ;
2023-06-01 17:58:59 -07:00
} else if ( sysfs_streq ( val , " force " ) ) {
2019-11-04 12:22:02 +01:00
new_val = 1 ;
2023-06-01 17:58:59 -07:00
} else if ( sysfs_streq ( val , " auto " ) ) {
2019-11-04 12:22:02 +01:00
new_val = get_nx_auto_mode ( ) ;
2023-06-01 17:58:59 -07:00
} else if ( sysfs_streq ( val , " never " ) ) {
new_val = 0 ;
mutex_lock ( & kvm_lock ) ;
if ( ! list_empty ( & vm_list ) ) {
mutex_unlock ( & kvm_lock ) ;
return - EBUSY ;
}
nx_hugepage_mitigation_hard_disabled = true ;
mutex_unlock ( & kvm_lock ) ;
} else if ( kstrtobool ( val , & new_val ) < 0 ) {
2019-11-04 12:22:02 +01:00
return - EINVAL ;
2023-06-01 17:58:59 -07:00
}
2019-11-04 12:22:02 +01:00
__set_nx_huge_pages ( new_val ) ;
if ( new_val ! = old_val ) {
struct kvm * kvm ;
mutex_lock ( & kvm_lock ) ;
list_for_each_entry ( kvm , & vm_list , vm_list ) {
KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast()
Acquire the per-VM slots_lock when zapping all shadow pages as part of
toggling nx_huge_pages. The fast zap algorithm relies on exclusivity
(via slots_lock) to identify obsolete vs. valid shadow pages, because it
uses a single bit for its generation number. Holding slots_lock also
obviates the need to acquire a read lock on the VM's srcu.
Failing to take slots_lock when toggling nx_huge_pages allows multiple
instances of kvm_mmu_zap_all_fast() to run concurrently, as the other
user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock.
(kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be
temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough
to enforce exclusivity).
Concurrent fast zap instances causes obsolete shadow pages to be
incorrectly identified as valid due to the single bit generation number
wrapping, which results in stale shadow pages being left in KVM's MMU
and leads to all sorts of undesirable behavior.
The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and
toggling nx_huge_pages via its module param.
Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when
using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used
an ulong-sized generation instead of relying on exclusivity for
correctness, but all callers except the recently added set_nx_huge_pages()
needed to hold slots_lock anyways. Therefore, this patch does not have
to be backported to stable kernels.
Given that toggling nx_huge_pages is by no means a fast path, force it
to conform to the current approach instead of reintroducing the previous
generation count.
Fixes: b8e8c8303ff28 ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE)
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-13 11:30:32 -08:00
mutex_lock ( & kvm - > slots_lock ) ;
2019-11-04 12:22:02 +01:00
kvm_mmu_zap_all_fast ( kvm ) ;
KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast()
Acquire the per-VM slots_lock when zapping all shadow pages as part of
toggling nx_huge_pages. The fast zap algorithm relies on exclusivity
(via slots_lock) to identify obsolete vs. valid shadow pages, because it
uses a single bit for its generation number. Holding slots_lock also
obviates the need to acquire a read lock on the VM's srcu.
Failing to take slots_lock when toggling nx_huge_pages allows multiple
instances of kvm_mmu_zap_all_fast() to run concurrently, as the other
user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock.
(kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be
temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough
to enforce exclusivity).
Concurrent fast zap instances causes obsolete shadow pages to be
incorrectly identified as valid due to the single bit generation number
wrapping, which results in stale shadow pages being left in KVM's MMU
and leads to all sorts of undesirable behavior.
The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and
toggling nx_huge_pages via its module param.
Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when
using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used
an ulong-sized generation instead of relying on exclusivity for
correctness, but all callers except the recently added set_nx_huge_pages()
needed to hold slots_lock anyways. Therefore, this patch does not have
to be backported to stable kernels.
Given that toggling nx_huge_pages is by no means a fast path, force it
to conform to the current approach instead of reintroducing the previous
generation count.
Fixes: b8e8c8303ff28 ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE)
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-13 11:30:32 -08:00
mutex_unlock ( & kvm - > slots_lock ) ;
2019-11-04 20:26:00 +01:00
2022-10-19 16:56:12 +00:00
wake_up_process ( kvm - > arch . nx_huge_page_recovery_thread ) ;
2019-11-04 12:22:02 +01:00
}
mutex_unlock ( & kvm_lock ) ;
}
return 0 ;
}
KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded
Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
-1 is technically undefined behavior when its value is read out by
param_get_bool(), as boolean values are supposed to be '0' or '1'.
Alternatively, KVM could define a custom getter for the param, but the
auto value doesn't depend on the vendor module in any way, and printing
"auto" would be unnecessarily unfriendly to the user.
In addition to fixing the undefined behavior, resolving the auto value
also fixes the scenario where the auto value resolves to N and no vendor
module is loaded. Previously, -1 would result in Y being printed even
though KVM would ultimately disable the mitigation.
Rename the existing MMU module init/exit helpers to clarify that they're
invoked with respect to the vendor module, and add comments to document
why KVM has two separate "module init" flows.
=========================================================================
UBSAN: invalid-load in kernel/params.c:320:33
load of value 255 is not a valid value for type '_Bool'
CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
ubsan_epilogue+0x5/0x40
__ubsan_handle_load_invalid_value.cold+0x43/0x48
param_get_bool.cold+0xf/0x14
param_attr_show+0x55/0x80
module_attr_show+0x1c/0x30
sysfs_kf_seq_show+0x93/0xc0
seq_read_iter+0x11c/0x450
new_sync_read+0x11b/0x1a0
vfs_read+0xf0/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
=========================================================================
Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: stable@vger.kernel.org
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220331221359.3912754-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-31 22:13:59 +00:00
/*
* nx_huge_pages needs to be resolved to true / false when kvm . ko is loaded , as
* its default value of - 1 is technically undefined behavior for a boolean .
2022-08-03 22:49:56 +00:00
* Forward the module init call to SPTE code so that it too can handle module
* params that need to be resolved / snapshot .
KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded
Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
-1 is technically undefined behavior when its value is read out by
param_get_bool(), as boolean values are supposed to be '0' or '1'.
Alternatively, KVM could define a custom getter for the param, but the
auto value doesn't depend on the vendor module in any way, and printing
"auto" would be unnecessarily unfriendly to the user.
In addition to fixing the undefined behavior, resolving the auto value
also fixes the scenario where the auto value resolves to N and no vendor
module is loaded. Previously, -1 would result in Y being printed even
though KVM would ultimately disable the mitigation.
Rename the existing MMU module init/exit helpers to clarify that they're
invoked with respect to the vendor module, and add comments to document
why KVM has two separate "module init" flows.
=========================================================================
UBSAN: invalid-load in kernel/params.c:320:33
load of value 255 is not a valid value for type '_Bool'
CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
ubsan_epilogue+0x5/0x40
__ubsan_handle_load_invalid_value.cold+0x43/0x48
param_get_bool.cold+0xf/0x14
param_attr_show+0x55/0x80
module_attr_show+0x1c/0x30
sysfs_kf_seq_show+0x93/0xc0
seq_read_iter+0x11c/0x450
new_sync_read+0x11b/0x1a0
vfs_read+0xf0/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
=========================================================================
Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: stable@vger.kernel.org
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220331221359.3912754-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-31 22:13:59 +00:00
*/
2022-08-03 22:49:55 +00:00
void __init kvm_mmu_x86_module_init ( void )
2007-04-15 16:31:09 +03:00
{
2019-11-04 12:22:02 +01:00
if ( nx_huge_pages = = - 1 )
__set_nx_huge_pages ( get_nx_auto_mode ( ) ) ;
2022-08-03 22:49:56 +00:00
2022-09-21 10:35:37 -07:00
/*
* Snapshot userspace ' s desire to enable the TDP MMU . Whether or not the
* TDP MMU is actually enabled is determined in kvm_configure_mmu ( )
* when the vendor module is loaded .
*/
tdp_mmu_allowed = tdp_mmu_enabled ;
2022-08-03 22:49:56 +00:00
kvm_mmu_spte_module_init ( ) ;
KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded
Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
-1 is technically undefined behavior when its value is read out by
param_get_bool(), as boolean values are supposed to be '0' or '1'.
Alternatively, KVM could define a custom getter for the param, but the
auto value doesn't depend on the vendor module in any way, and printing
"auto" would be unnecessarily unfriendly to the user.
In addition to fixing the undefined behavior, resolving the auto value
also fixes the scenario where the auto value resolves to N and no vendor
module is loaded. Previously, -1 would result in Y being printed even
though KVM would ultimately disable the mitigation.
Rename the existing MMU module init/exit helpers to clarify that they're
invoked with respect to the vendor module, and add comments to document
why KVM has two separate "module init" flows.
=========================================================================
UBSAN: invalid-load in kernel/params.c:320:33
load of value 255 is not a valid value for type '_Bool'
CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
ubsan_epilogue+0x5/0x40
__ubsan_handle_load_invalid_value.cold+0x43/0x48
param_get_bool.cold+0xf/0x14
param_attr_show+0x55/0x80
module_attr_show+0x1c/0x30
sysfs_kf_seq_show+0x93/0xc0
seq_read_iter+0x11c/0x450
new_sync_read+0x11b/0x1a0
vfs_read+0xf0/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
=========================================================================
Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: stable@vger.kernel.org
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220331221359.3912754-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-31 22:13:59 +00:00
}
/*
* The bulk of the MMU initialization is deferred until the vendor module is
* loaded as many of the masks / values may be modified by VMX or SVM , i . e . need
* to be reset when a potentially different vendor module is loaded .
*/
int kvm_mmu_vendor_module_init ( void )
{
int ret = - ENOMEM ;
2019-11-04 12:22:02 +01:00
2018-10-08 21:28:10 +02:00
/*
* MMU roles use union aliasing which is , generally speaking , an
* undefined behavior . However , we supposedly know how compilers behave
* and the current status quo is unlikely to change . Guardians below are
* supposed to let us know if the assumption becomes false .
*/
BUILD_BUG_ON ( sizeof ( union kvm_mmu_page_role ) ! = sizeof ( u32 ) ) ;
BUILD_BUG_ON ( sizeof ( union kvm_mmu_extended_role ) ! = sizeof ( u32 ) ) ;
2022-02-10 07:38:32 -05:00
BUILD_BUG_ON ( sizeof ( union kvm_cpu_role ) ! = sizeof ( u64 ) ) ;
2018-10-08 21:28:10 +02:00
2018-08-14 10:15:34 -07:00
kvm_mmu_reset_all_pte_masks ( ) ;
2016-12-06 16:46:16 -08:00
2024-01-16 18:00:25 +08:00
pte_list_desc_cache = KMEM_CACHE ( pte_list_desc , SLAB_ACCOUNT ) ;
2011-05-15 23:26:20 +08:00
if ( ! pte_list_desc_cache )
2018-01-10 17:26:59 +01:00
goto out ;
2007-04-15 16:31:09 +03:00
2007-05-30 12:34:53 +03:00
mmu_page_header_cache = kmem_cache_create ( " kvm_mmu_page_header " ,
sizeof ( struct kvm_mmu_page ) ,
2017-10-05 18:07:24 -07:00
0 , SLAB_ACCOUNT , NULL ) ;
2007-05-30 12:34:53 +03:00
if ( ! mmu_page_header_cache )
2018-01-10 17:26:59 +01:00
goto out ;
2007-05-30 12:34:53 +03:00
2014-09-08 09:51:29 +09:00
if ( percpu_counter_init ( & kvm_total_used_mmu_pages , 0 , GFP_KERNEL ) )
2018-01-10 17:26:59 +01:00
goto out ;
2010-08-23 16:13:15 +08:00
2023-09-11 17:44:01 +08:00
mmu_shrinker = shrinker_alloc ( 0 , " x86-mmu " ) ;
if ( ! mmu_shrinker )
2022-08-23 14:32:37 +08:00
goto out_shrinker ;
2008-03-30 15:17:21 +03:00
2023-09-11 17:44:01 +08:00
mmu_shrinker - > count_objects = mmu_shrink_count ;
mmu_shrinker - > scan_objects = mmu_shrink_scan ;
mmu_shrinker - > seeks = DEFAULT_SEEKS * 10 ;
shrinker_register ( mmu_shrinker ) ;
2007-04-15 16:31:09 +03:00
return 0 ;
2022-08-23 14:32:37 +08:00
out_shrinker :
percpu_counter_destroy ( & kvm_total_used_mmu_pages ) ;
2018-01-10 17:26:59 +01:00
out :
2008-03-30 15:17:21 +03:00
mmu_destroy_caches ( ) ;
2018-01-10 17:26:59 +01:00
return ret ;
2007-04-15 16:31:09 +03:00
}
2010-09-27 18:07:07 +08:00
void kvm_mmu_destroy ( struct kvm_vcpu * vcpu )
{
2013-10-02 16:56:12 +02:00
kvm_mmu_unload ( vcpu ) ;
2019-06-22 19:42:04 +02:00
free_mmu_pages ( & vcpu - > arch . root_mmu ) ;
free_mmu_pages ( & vcpu - > arch . guest_mmu ) ;
2010-09-27 18:07:07 +08:00
mmu_free_memory_caches ( vcpu ) ;
2010-12-23 16:08:35 +08:00
}
KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded
Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
-1 is technically undefined behavior when its value is read out by
param_get_bool(), as boolean values are supposed to be '0' or '1'.
Alternatively, KVM could define a custom getter for the param, but the
auto value doesn't depend on the vendor module in any way, and printing
"auto" would be unnecessarily unfriendly to the user.
In addition to fixing the undefined behavior, resolving the auto value
also fixes the scenario where the auto value resolves to N and no vendor
module is loaded. Previously, -1 would result in Y being printed even
though KVM would ultimately disable the mitigation.
Rename the existing MMU module init/exit helpers to clarify that they're
invoked with respect to the vendor module, and add comments to document
why KVM has two separate "module init" flows.
=========================================================================
UBSAN: invalid-load in kernel/params.c:320:33
load of value 255 is not a valid value for type '_Bool'
CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
ubsan_epilogue+0x5/0x40
__ubsan_handle_load_invalid_value.cold+0x43/0x48
param_get_bool.cold+0xf/0x14
param_attr_show+0x55/0x80
module_attr_show+0x1c/0x30
sysfs_kf_seq_show+0x93/0xc0
seq_read_iter+0x11c/0x450
new_sync_read+0x11b/0x1a0
vfs_read+0xf0/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
=========================================================================
Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: stable@vger.kernel.org
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220331221359.3912754-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-31 22:13:59 +00:00
void kvm_mmu_vendor_module_exit ( void )
2010-12-23 16:08:35 +08:00
{
mmu_destroy_caches ( ) ;
percpu_counter_destroy ( & kvm_total_used_mmu_pages ) ;
2023-09-11 17:44:01 +08:00
shrinker_free ( mmu_shrinker ) ;
2010-09-27 18:07:07 +08:00
}
2019-11-04 20:26:00 +01:00
2021-11-20 01:57:06 +00:00
/*
* Calculate the effective recovery period , accounting for ' 0 ' meaning " let KVM
* select a halving time of 1 hour " . Returns true if recovery is enabled.
*/
static bool calc_nx_huge_pages_recovery_period ( uint * period )
{
/*
* Use READ_ONCE to get the params , this may be called outside of the
* param setters , e . g . by the kthread to compute its next timeout .
*/
bool enabled = READ_ONCE ( nx_huge_pages ) ;
uint ratio = READ_ONCE ( nx_huge_pages_recovery_ratio ) ;
if ( ! enabled | | ! ratio )
return false ;
* period = READ_ONCE ( nx_huge_pages_recovery_period_ms ) ;
if ( ! * period ) {
/* Make sure the period is not less than one second. */
ratio = min ( ratio , 3600u ) ;
* period = 60 * 60 * 1000 / ratio ;
}
return true ;
}
2021-10-19 18:06:27 -07:00
static int set_nx_huge_pages_recovery_param ( const char * val , const struct kernel_param * kp )
2019-11-04 20:26:00 +01:00
{
2021-10-19 18:06:27 -07:00
bool was_recovery_enabled , is_recovery_enabled ;
uint old_period , new_period ;
2019-11-04 20:26:00 +01:00
int err ;
2023-06-01 17:58:59 -07:00
if ( nx_hugepage_mitigation_hard_disabled )
return - EPERM ;
2021-11-20 01:57:06 +00:00
was_recovery_enabled = calc_nx_huge_pages_recovery_period ( & old_period ) ;
2021-10-19 18:06:27 -07:00
2019-11-04 20:26:00 +01:00
err = param_set_uint ( val , kp ) ;
if ( err )
return err ;
2021-11-20 01:57:06 +00:00
is_recovery_enabled = calc_nx_huge_pages_recovery_period ( & new_period ) ;
2021-10-19 18:06:27 -07:00
2021-11-20 01:57:06 +00:00
if ( is_recovery_enabled & &
2021-10-19 18:06:27 -07:00
( ! was_recovery_enabled | | old_period > new_period ) ) {
2019-11-04 20:26:00 +01:00
struct kvm * kvm ;
mutex_lock ( & kvm_lock ) ;
list_for_each_entry ( kvm , & vm_list , vm_list )
2022-10-19 16:56:12 +00:00
wake_up_process ( kvm - > arch . nx_huge_page_recovery_thread ) ;
2019-11-04 20:26:00 +01:00
mutex_unlock ( & kvm_lock ) ;
}
return err ;
}
2022-10-19 16:56:12 +00:00
static void kvm_recover_nx_huge_pages ( struct kvm * kvm )
2019-11-04 20:26:00 +01:00
{
2021-06-15 09:29:05 -07:00
unsigned long nx_lpage_splits = kvm - > stat . nx_lpage_splits ;
2022-11-03 13:44:21 -07:00
struct kvm_memory_slot * slot ;
2019-11-04 20:26:00 +01:00
int rcu_idx ;
struct kvm_mmu_page * sp ;
unsigned int ratio ;
LIST_HEAD ( invalid_list ) ;
2021-03-25 13:01:18 -07:00
bool flush = false ;
2019-11-04 20:26:00 +01:00
ulong to_zap ;
rcu_idx = srcu_read_lock ( & kvm - > srcu ) ;
2021-02-02 10:57:24 -08:00
write_lock ( & kvm - > mmu_lock ) ;
2019-11-04 20:26:00 +01:00
2022-02-26 00:15:37 +00:00
/*
* Zapping TDP MMU shadow pages , including the remote TLB flush , must
* be done under RCU protection , because the pages are freed via RCU
* callback .
*/
rcu_read_lock ( ) ;
2019-11-04 20:26:00 +01:00
ratio = READ_ONCE ( nx_huge_pages_recovery_ratio ) ;
2021-06-15 09:29:05 -07:00
to_zap = ratio ? DIV_ROUND_UP ( nx_lpage_splits , ratio ) : 0 ;
2020-09-23 11:37:29 -07:00
for ( ; to_zap ; - - to_zap ) {
2022-10-19 16:56:12 +00:00
if ( list_empty ( & kvm - > arch . possible_nx_huge_pages ) )
2020-09-23 11:37:29 -07:00
break ;
2019-11-04 20:26:00 +01:00
/*
* We use a separate list instead of just using active_mmu_pages
2022-10-19 16:56:12 +00:00
* because the number of shadow pages that be replaced with an
* NX huge page is expected to be relatively small compared to
* the total number of shadow pages . And because the TDP MMU
* doesn ' t use active_mmu_pages .
2019-11-04 20:26:00 +01:00
*/
2022-10-19 16:56:12 +00:00
sp = list_first_entry ( & kvm - > arch . possible_nx_huge_pages ,
2019-11-04 20:26:00 +01:00
struct kvm_mmu_page ,
2022-10-19 16:56:12 +00:00
possible_nx_huge_page_link ) ;
WARN_ON_ONCE ( ! sp - > nx_huge_page_disallowed ) ;
2022-11-03 13:44:21 -07:00
WARN_ON_ONCE ( ! sp - > role . direct ) ;
/*
* Unaccount and do not attempt to recover any NX Huge Pages
* that are being dirty tracked , as they would just be faulted
* back in as 4 KiB pages . The NX Huge Pages in this slot will be
* recovered , along with all the other huge pages in the slot ,
* when dirty logging is disabled .
2022-11-17 12:25:02 -05:00
*
* Since gfn_to_memslot ( ) is relatively expensive , it helps to
* skip it if it the test cannot possibly return true . On the
* other hand , if any memslot has logging enabled , chances are
* good that all of them do , in which case unaccount_nx_huge_page ( )
* is much cheaper than zapping the page .
*
* If a memslot update is in progress , reading an incorrect value
* of kvm - > nr_memslots_dirty_logging is not a problem : if it is
* becoming zero , gfn_to_memslot ( ) will be done unnecessarily ; if
* it is becoming nonzero , the page will be zapped unnecessarily .
* Either way , this only affects efficiency in racy situations ,
* and not correctness .
2022-11-03 13:44:21 -07:00
*/
2022-11-17 12:25:02 -05:00
slot = NULL ;
if ( atomic_read ( & kvm - > nr_memslots_dirty_logging ) ) {
KVM: x86/mmu: Grab memslot for correct address space in NX recovery worker
Factor in the address space (non-SMM vs. SMM) of the target shadow page
when recovering potential NX huge pages, otherwise KVM will retrieve the
wrong memslot when zapping shadow pages that were created for SMM. The
bug most visibly manifests as a WARN on the memslot being non-NULL, but
the worst case scenario is that KVM could unaccount the shadow page
without ensuring KVM won't install a huge page, i.e. if the non-SMM slot
is being dirty logged, but the SMM slot is not.
------------[ cut here ]------------
WARNING: CPU: 1 PID: 3911 at arch/x86/kvm/mmu/mmu.c:7015
kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm]
CPU: 1 PID: 3911 Comm: kvm-nx-lpage-re
RIP: 0010:kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm]
RSP: 0018:ffff99b284f0be68 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff99b284edd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff9271397024e0 R08: 0000000000000000 R09: ffff927139702450
R10: 0000000000000000 R11: 0000000000000001 R12: ffff99b284f0be98
R13: 0000000000000000 R14: ffff9270991fcd80 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff927f9f640000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0aacad3ae0 CR3: 000000088fc2c005 CR4: 00000000003726e0
Call Trace:
<TASK>
__pfx_kvm_nx_huge_page_recovery_worker+0x10/0x10 [kvm]
kvm_vm_worker_thread+0x106/0x1c0 [kvm]
kthread+0xd9/0x100
ret_from_fork+0x2c/0x50
</TASK>
---[ end trace 0000000000000000 ]---
This bug was exposed by commit edbdb43fc96b ("KVM: x86: Preserve TDP MMU
roots until they are explicitly invalidated"), which allowed KVM to retain
SMM TDP MMU roots effectively indefinitely. Before commit edbdb43fc96b,
KVM would zap all SMM TDP MMU roots and thus all SMM TDP MMU shadow pages
once all vCPUs exited SMM, which made the window where this bug (recovering
an SMM NX huge page) could be encountered quite tiny. To hit the bug, the
NX recovery thread would have to run while at least one vCPU was in SMM.
Most VMs typically only use SMM during boot, and so the problematic shadow
pages were gone by the time the NX recovery thread ran.
Now that KVM preserves TDP MMU roots until they are explicitly invalidated
(e.g. by a memslot deletion), the window to trigger the bug is effectively
never closed because most VMMs don't delete memslots after boot (except
for a handful of special scenarios).
Fixes: eb298605705a ("KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages")
Reported-by: Fabio Coatti <fabio.coatti@gmail.com>
Closes: https://lore.kernel.org/all/CADpTngX9LESCdHVu_2mQkNGena_Ng2CphWNwsRGSMxzDsTjU2A@mail.gmail.com
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230602010137.784664-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01 18:01:37 -07:00
struct kvm_memslots * slots ;
slots = kvm_memslots_for_spte_role ( kvm , sp - > role ) ;
slot = __gfn_to_memslot ( slots , sp - > gfn ) ;
2022-11-17 12:25:02 -05:00
WARN_ON_ONCE ( ! slot ) ;
}
2022-11-03 13:44:21 -07:00
if ( slot & & kvm_slot_dirty_track_enabled ( slot ) )
unaccount_nx_huge_page ( kvm , sp ) ;
else if ( is_tdp_mmu_page ( sp ) )
2021-04-06 11:08:51 -04:00
flush | = kvm_tdp_mmu_zap_sp ( kvm , sp ) ;
2022-10-19 16:56:18 +00:00
else
2020-10-14 11:27:00 -07:00
kvm_mmu_prepare_zap_page ( kvm , sp , & invalid_list ) ;
2022-10-19 16:56:18 +00:00
WARN_ON_ONCE ( sp - > nx_huge_page_disallowed ) ;
2019-11-04 20:26:00 +01:00
2021-02-02 10:57:24 -08:00
if ( need_resched ( ) | | rwlock_needbreak ( & kvm - > mmu_lock ) ) {
2021-03-25 13:01:18 -07:00
kvm_mmu_remote_flush_or_zap ( kvm , & invalid_list , flush ) ;
2022-02-26 00:15:37 +00:00
rcu_read_unlock ( ) ;
2021-02-02 10:57:24 -08:00
cond_resched_rwlock_write ( & kvm - > mmu_lock ) ;
2021-03-25 13:01:18 -07:00
flush = false ;
2022-02-26 00:15:37 +00:00
rcu_read_lock ( ) ;
2019-11-04 20:26:00 +01:00
}
}
2021-03-25 13:01:18 -07:00
kvm_mmu_remote_flush_or_zap ( kvm , & invalid_list , flush ) ;
2019-11-04 20:26:00 +01:00
2022-02-26 00:15:37 +00:00
rcu_read_unlock ( ) ;
2021-02-02 10:57:24 -08:00
write_unlock ( & kvm - > mmu_lock ) ;
2019-11-04 20:26:00 +01:00
srcu_read_unlock ( & kvm - > srcu , rcu_idx ) ;
}
2022-10-19 16:56:12 +00:00
static long get_nx_huge_page_recovery_timeout ( u64 start_time )
2019-11-04 20:26:00 +01:00
{
2021-11-20 01:57:06 +00:00
bool enabled ;
uint period ;
2021-10-19 18:06:27 -07:00
2021-11-20 01:57:06 +00:00
enabled = calc_nx_huge_pages_recovery_period ( & period ) ;
2021-10-19 18:06:27 -07:00
2021-11-20 01:57:06 +00:00
return enabled ? start_time + msecs_to_jiffies ( period ) - get_jiffies_64 ( )
: MAX_SCHEDULE_TIMEOUT ;
2019-11-04 20:26:00 +01:00
}
2022-10-19 16:56:12 +00:00
static int kvm_nx_huge_page_recovery_worker ( struct kvm * kvm , uintptr_t data )
2019-11-04 20:26:00 +01:00
{
u64 start_time ;
long remaining_time ;
while ( true ) {
start_time = get_jiffies_64 ( ) ;
2022-10-19 16:56:12 +00:00
remaining_time = get_nx_huge_page_recovery_timeout ( start_time ) ;
2019-11-04 20:26:00 +01:00
set_current_state ( TASK_INTERRUPTIBLE ) ;
while ( ! kthread_should_stop ( ) & & remaining_time > 0 ) {
schedule_timeout ( remaining_time ) ;
2022-10-19 16:56:12 +00:00
remaining_time = get_nx_huge_page_recovery_timeout ( start_time ) ;
2019-11-04 20:26:00 +01:00
set_current_state ( TASK_INTERRUPTIBLE ) ;
}
set_current_state ( TASK_RUNNING ) ;
if ( kthread_should_stop ( ) )
return 0 ;
2022-10-19 16:56:12 +00:00
kvm_recover_nx_huge_pages ( kvm ) ;
2019-11-04 20:26:00 +01:00
}
}
int kvm_mmu_post_init_vm ( struct kvm * kvm )
{
int err ;
2023-06-01 17:58:59 -07:00
if ( nx_hugepage_mitigation_hard_disabled )
return 0 ;
2022-10-19 16:56:12 +00:00
err = kvm_vm_create_worker_thread ( kvm , kvm_nx_huge_page_recovery_worker , 0 ,
2019-11-04 20:26:00 +01:00
" kvm-nx-lpage-recovery " ,
2022-10-19 16:56:12 +00:00
& kvm - > arch . nx_huge_page_recovery_thread ) ;
2019-11-04 20:26:00 +01:00
if ( ! err )
2022-10-19 16:56:12 +00:00
kthread_unpark ( kvm - > arch . nx_huge_page_recovery_thread ) ;
2019-11-04 20:26:00 +01:00
return err ;
}
void kvm_mmu_pre_destroy_vm ( struct kvm * kvm )
{
2022-10-19 16:56:12 +00:00
if ( kvm - > arch . nx_huge_page_recovery_thread )
kthread_stop ( kvm - > arch . nx_huge_page_recovery_thread ) ;
2019-11-04 20:26:00 +01:00
}
2023-10-27 11:22:01 -07:00
# ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-27 11:22:02 -07:00
bool kvm_arch_pre_set_memory_attributes ( struct kvm * kvm ,
struct kvm_gfn_range * range )
{
/*
* Zap SPTEs even if the slot can ' t be mapped PRIVATE . KVM x86 only
* supports KVM_MEMORY_ATTRIBUTE_PRIVATE , and so it * seems * like KVM
* can simply ignore such slots . But if userspace is making memory
* PRIVATE , then KVM must prevent the guest from accessing the memory
* as shared . And if userspace is making memory SHARED and this point
* is reached , then at least one page within the range was previously
* PRIVATE , i . e . the slot ' s possible hugepage ranges are changing .
* Zapping SPTEs in this case ensures KVM will reassess whether or not
* a hugepage can be used for affected ranges .
*/
if ( WARN_ON_ONCE ( ! kvm_arch_has_private_mem ( kvm ) ) )
return false ;
return kvm_unmap_gfn_range ( kvm , range ) ;
}
2023-10-27 11:22:01 -07:00
static bool hugepage_test_mixed ( struct kvm_memory_slot * slot , gfn_t gfn ,
int level )
{
return lpage_info_slot ( gfn , slot , level ) - > disallow_lpage & KVM_LPAGE_MIXED_FLAG ;
}
static void hugepage_clear_mixed ( struct kvm_memory_slot * slot , gfn_t gfn ,
int level )
{
lpage_info_slot ( gfn , slot , level ) - > disallow_lpage & = ~ KVM_LPAGE_MIXED_FLAG ;
}
static void hugepage_set_mixed ( struct kvm_memory_slot * slot , gfn_t gfn ,
int level )
{
lpage_info_slot ( gfn , slot , level ) - > disallow_lpage | = KVM_LPAGE_MIXED_FLAG ;
}
static bool hugepage_has_attrs ( struct kvm * kvm , struct kvm_memory_slot * slot ,
gfn_t gfn , int level , unsigned long attrs )
{
const unsigned long start = gfn ;
const unsigned long end = start + KVM_PAGES_PER_HPAGE ( level ) ;
if ( level = = PG_LEVEL_2M )
return kvm_range_has_memory_attributes ( kvm , start , end , attrs ) ;
for ( gfn = start ; gfn < end ; gfn + = KVM_PAGES_PER_HPAGE ( level - 1 ) ) {
if ( hugepage_test_mixed ( slot , gfn , level - 1 ) | |
attrs ! = kvm_get_memory_attributes ( kvm , gfn ) )
return false ;
}
return true ;
}
bool kvm_arch_post_set_memory_attributes ( struct kvm * kvm ,
struct kvm_gfn_range * range )
{
unsigned long attrs = range - > arg . attributes ;
struct kvm_memory_slot * slot = range - > slot ;
int level ;
lockdep_assert_held_write ( & kvm - > mmu_lock ) ;
lockdep_assert_held ( & kvm - > slots_lock ) ;
/*
* Calculate which ranges can be mapped with hugepages even if the slot
* can ' t map memory PRIVATE . KVM mustn ' t create a SHARED hugepage over
* a range that has PRIVATE GFNs , and conversely converting a range to
* SHARED may now allow hugepages .
*/
if ( WARN_ON_ONCE ( ! kvm_arch_has_private_mem ( kvm ) ) )
return false ;
/*
* The sequence matters here : upper levels consume the result of lower
* level ' s scanning .
*/
for ( level = PG_LEVEL_2M ; level < = KVM_MAX_HUGEPAGE_LEVEL ; level + + ) {
gfn_t nr_pages = KVM_PAGES_PER_HPAGE ( level ) ;
gfn_t gfn = gfn_round_for_level ( range - > start , level ) ;
/* Process the head page if it straddles the range. */
if ( gfn ! = range - > start | | gfn + nr_pages > range - > end ) {
/*
* Skip mixed tracking if the aligned gfn isn ' t covered
* by the memslot , KVM can ' t use a hugepage due to the
* misaligned address regardless of memory attributes .
*/
KVM: x86/mmu: x86: Don't overflow lpage_info when checking attributes
Fix KVM_SET_MEMORY_ATTRIBUTES to not overflow lpage_info array and trigger
KASAN splat, as seen in the private_mem_conversions_test selftest.
When memory attributes are set on a GFN range, that range will have
specific properties applied to the TDP. A huge page cannot be used when
the attributes are inconsistent, so they are disabled for those the
specific huge pages. For internal KVM reasons, huge pages are also not
allowed to span adjacent memslots regardless of whether the backing memory
could be mapped as huge.
What GFNs support which huge page sizes is tracked by an array of arrays
'lpage_info' on the memslot, of ‘kvm_lpage_info’ structs. Each index of
lpage_info contains a vmalloc allocated array of these for a specific
supported page size. The kvm_lpage_info denotes whether a specific huge
page (GFN and page size) on the memslot is supported. These arrays include
indices for unaligned head and tail huge pages.
Preventing huge pages from spanning adjacent memslot is covered by
incrementing the count in head and tail kvm_lpage_info when the memslot is
allocated, but disallowing huge pages for memory that has mixed attributes
has to be done in a more complicated way. During the
KVM_SET_MEMORY_ATTRIBUTES ioctl KVM updates lpage_info for each memslot in
the range that has mismatched attributes. KVM does this a memslot at a
time, and marks a special bit, KVM_LPAGE_MIXED_FLAG, in the kvm_lpage_info
for any huge page. This bit is essentially a permanently elevated count.
So huge pages will not be mapped for the GFN at that page size if the
count is elevated in either case: a huge head or tail page unaligned to
the memslot or if KVM_LPAGE_MIXED_FLAG is set because it has mixed
attributes.
To determine whether a huge page has consistent attributes, the
KVM_SET_MEMORY_ATTRIBUTES operation checks an xarray to make sure it
consistently has the incoming attribute. Since level - 1 huge pages are
aligned to level huge pages, it employs an optimization. As long as the
level - 1 huge pages are checked first, it can just check these and assume
that if each level - 1 huge page contained within the level sized huge
page is not mixed, then the level size huge page is not mixed. This
optimization happens in the helper hugepage_has_attrs().
Unfortunately, although the kvm_lpage_info array representing page size
'level' will contain an entry for an unaligned tail page of size level,
the array for level - 1 will not contain an entry for each GFN at page
size level. The level - 1 array will only contain an index for any
unaligned region covered by level - 1 huge page size, which can be a
smaller region. So this causes the optimization to overflow the level - 1
kvm_lpage_info and perform a vmalloc out of bounds read.
In some cases of head and tail pages where an overflow could happen,
callers skip the operation completely as KVM_LPAGE_MIXED_FLAG is not
required to prevent huge pages as discussed earlier. But for memslots that
are smaller than the 1GB page size, it does call hugepage_has_attrs(). In
this case the huge page is both the head and tail page. The issue can be
observed simply by compiling the kernel with CONFIG_KASAN_VMALLOC and
running the selftest “private_mem_conversions_test”, which produces the
output like the following:
BUG: KASAN: vmalloc-out-of-bounds in hugepage_has_attrs+0x7e/0x110
Read of size 4 at addr ffffc900000a3008 by task private_mem_con/169
Call Trace:
dump_stack_lvl
print_report
? __virt_addr_valid
? hugepage_has_attrs
? hugepage_has_attrs
kasan_report
? hugepage_has_attrs
hugepage_has_attrs
kvm_arch_post_set_memory_attributes
kvm_vm_ioctl
It is a little ambiguous whether the unaligned head page (in the bug case
also the tail page) should be expected to have KVM_LPAGE_MIXED_FLAG set.
It is not functionally required, as the unaligned head/tail pages will
already have their kvm_lpage_info count incremented. The comments imply
not setting it on unaligned head pages is intentional, so fix the callers
to skip trying to set KVM_LPAGE_MIXED_FLAG in this case, and in doing so
not call hugepage_has_attrs().
Cc: stable@vger.kernel.org
Fixes: 90b4fe17981e ("KVM: x86: Disallow hugepages when memory attributes are mixed")
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Peng <chao.p.peng@linux.intel.com>
Link: https://lore.kernel.org/r/20240314212902.2762507-1-rick.p.edgecombe@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-14 14:29:02 -07:00
if ( gfn > = slot - > base_gfn & &
gfn + nr_pages < = slot - > base_gfn + slot - > npages ) {
2023-10-27 11:22:01 -07:00
if ( hugepage_has_attrs ( kvm , slot , gfn , level , attrs ) )
hugepage_clear_mixed ( slot , gfn , level ) ;
else
hugepage_set_mixed ( slot , gfn , level ) ;
}
gfn + = nr_pages ;
}
/*
* Pages entirely covered by the range are guaranteed to have
* only the attributes which were just set .
*/
for ( ; gfn + nr_pages < = range - > end ; gfn + = nr_pages )
hugepage_clear_mixed ( slot , gfn , level ) ;
/*
* Process the last tail page if it straddles the range and is
* contained by the memslot . Like the head page , KVM can ' t
* create a hugepage if the slot size is misaligned .
*/
if ( gfn < range - > end & &
( gfn + nr_pages ) < = ( slot - > base_gfn + slot - > npages ) ) {
if ( hugepage_has_attrs ( kvm , slot , gfn , level , attrs ) )
hugepage_clear_mixed ( slot , gfn , level ) ;
else
hugepage_set_mixed ( slot , gfn , level ) ;
}
}
return false ;
}
void kvm_mmu_init_memslot_memory_attributes ( struct kvm * kvm ,
struct kvm_memory_slot * slot )
{
int level ;
if ( ! kvm_arch_has_private_mem ( kvm ) )
return ;
for ( level = PG_LEVEL_2M ; level < = KVM_MAX_HUGEPAGE_LEVEL ; level + + ) {
/*
* Don ' t bother tracking mixed attributes for pages that can ' t
* be huge due to alignment , i . e . process only pages that are
* entirely contained by the memslot .
*/
gfn_t end = gfn_round_for_level ( slot - > base_gfn + slot - > npages , level ) ;
gfn_t start = gfn_round_for_level ( slot - > base_gfn , level ) ;
gfn_t nr_pages = KVM_PAGES_PER_HPAGE ( level ) ;
gfn_t gfn ;
if ( start < slot - > base_gfn )
start + = nr_pages ;
/*
* Unlike setting attributes , every potential hugepage needs to
* be manually checked as the attributes may already be mixed .
*/
for ( gfn = start ; gfn < end ; gfn + = nr_pages ) {
unsigned long attrs = kvm_get_memory_attributes ( kvm , gfn ) ;
if ( hugepage_has_attrs ( kvm , slot , gfn , level , attrs ) )
hugepage_clear_mixed ( slot , gfn , level ) ;
else
hugepage_set_mixed ( slot , gfn , level ) ;
}
}
}
# endif