0a7b73559b
Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR adds no value, negatively impacts guest performance, and is a maintenance burden due to it's complexity and oddities. KVM's approach to virtualizating MTRRs make no sense, at all. KVM *only* honors guest MTRR memtypes if EPT is enabled *and* the guest has a device that may perform non-coherent DMA access. From a hardware virtualization perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically account for guest MTRRs, nor does NPT. Unwinding and deciphering KVM's murky history, the MTRR virtualization code appears to be the result of misdiagnosed issues when EPT + VT-d with passthrough devices was enabled years and years ago. And importantly, the underlying bugs that were fudged around by honoring guest MTRR memtypes have since been fixed (though rather poorly in some cases). The zapping GFNs logic in the MTRR virtualization code came from: commitefdfe536d8
Author: Xiao Guangrong <guangrong.xiao@linux.intel.com> Date: Wed May 13 14:42:27 2015 +0800 KVM: MMU: fix MTRR update Currently, whenever guest MTRR registers are changed kvm_mmu_reset_context is called to switch to the new root shadow page table, however, it's useless since: 1) the cache type is not cached into shadow page's attribute so that the original root shadow page will be reused 2) the cache type is set on the last spte, that means we should sync the last sptes when MTRR is changed This patch fixs this issue by drop all the spte in the gfn range which is being updated by MTRR which was a fix for: commit0bed3b568b
Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Oct 9 16:01:54 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Wed Dec 31 16:51:44 2008 +0200 KVM: Improve MTRR structure As well as reset mmu context when set MTRR. which was part of a "MTRR/PAT support for EPT" series that also added: + if (mt_mask) { + mt_mask = get_memory_type(vcpu, gfn) << + kvm_x86_ops->get_mt_mask_shift(); + spte |= mt_mask; + } where get_memory_type() was a truly gnarly helper to retrieve the guest MTRR memtype for a given memtype. And *very* subtly, at the time of that change, KVM *always* set VMX_EPT_IGMT_BIT, kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK | VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT | VMX_EPT_IGMT_BIT); which came in via: commit928d4bf747
Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Nov 6 14:55:45 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Tue Nov 11 21:00:37 2008 +0200 KVM: VMX: Set IGMT bit in EPT entry There is a potential issue that, when guest using pagetable without vmexit when EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory, which would be inconsistent with host side and would cause host MCE due to inconsistent cache attribute. The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default memory type to protect host (notice that all memory mapped by KVM should be WB). Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole "ignoreIGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough enabling, but it was applied earlier because it was a generic fix. Jumping back to0bed3b568b
("KVM: Improve MTRR structure"), the other relevant code, or rather lack thereof, is the handling of *host* MMIO. That fix came in a bit later, but given the author and timing, it's safe to say it was all part of the same EPT+VT-d enabling mess. commit2aaf69dcee
Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Wed Jan 21 16:52:16 2009 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Sun Feb 15 02:47:37 2009 +0200 KVM: MMU: Map device MMIO as UC in EPT Software are not allow to access device MMIO using cacheable memory type, the patch limit MMIO region with UC and WC(guest can select WC using PAT and PCD/PWT). In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization was obviously never tested on NPT until much later, which lends further credence to the theory/argument that this was all the result of misdiagnosed issues. Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng Yang was trying to resolve issues with passthrough MMIO. * Sheng Yang : Do you mean host(qemu) would access this memory and if we set it to guest : MTRR, host access would be broken? We would cover this in our shadow MTRR : patch, for we encountered this in video ram when doing some experiment with : VGA assignment. And in the same thread, there's also what appears to be confirmation of Intel running into issues with Windows XP related to a guest device driver mapping DMA with WC in the PAT. * Avi Kavity : Sheng Yang wrote: : > Yes... But it's easy to do with assigned devices' mmio, but what if guest : > specific some non-mmio memory's memory type? E.g. we have met one issue in : > Xen, that a assigned-device's XP driver specific one memory region as buffer, : > and modify the memory type then do DMA. : > : > Only map MMIO space can be first step, but I guess we can modify assigned : > memory region memory type follow guest's? : > : : With ept/npt, we can't, since the memory type is in the guest's : pagetable entries, and these are not accessible. [*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com So, for the most part, what likely happened is that 15 years ago, a few engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough device MMIO by emulating *guest* MTRRs. Except for the below case, everything since then has been a result of those two intertwined changes. The one exception, which is actually yet more confirmation of all of the above, is the revert of Paolo's attempt at "full" virtualization of guest MTRRs: commit606decd670
Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:12:47 2015 +0200 Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages" This reverts commitfd717f1101
. It was reported to cause Machine Check Exceptions (bug 104091). ... commitfd717f1101
Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:38:13 2015 +0200 KVM: x86: apply guest MTRR virtualization on host reserved pages Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true. However, the guest could prefer a different page type than UC for such pages. A good example is that pass-throughed VGA frame buffer is not always UC as host expected. This patch enables full use of virtual guest MTRRs. I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT" and got the same result: machine checks, likely due to the guest MTRRs not being trustworthy/sane at all times. Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that too got reverted. Unfortunately, it doesn't appear that anyone ever found a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times doesn't appear to have a definitive root cause. commitfc07e76ac7
Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:20:22 2015 +0200 Revert "KVM: SVM: use NPT page attributes" This reverts commit3c2e7f7de3
. Initializing the mapping from MTRR to PAT values was reported to fail nondeterministically, and it also caused extremely slow boot (due to caching getting disabled---bug 103321) with assigned devices. ... commit3c2e7f7de3
Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:32:17 2015 +0200 KVM: SVM: use NPT page attributes Right now, NPT page attributes are not used, and the final page attribute depends solely on gPAT (which however is not synced correctly), the guest MTRRs and the guest page attributes. However, we can do better by mimicking what is done for VMX. In the absence of PCI passthrough, the guest PAT can be ignored and the page attributes can be just WB. If passthrough is being used, instead, keep respecting the guest PAT, and emulate the guest MTRRs through the PAT field of the nested page tables. The only snag is that WP memory cannot be emulated correctly, because Linux's default PAT setting only includes the other types. In short, honoring guest MTRRs for VMX was initially a workaround of sorts for KVM ignoring guest PAT *and* for KVM not forcing UC for host MMIO. And while there *are* known cases where honoring guest MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in that case is to get WC instead of UC, i.e. at this point it's for performance, not correctness. Furthermore, the complete absence of MTRR virtualization on NPT and shadow paging proves that, while KVM theoretically can do better, it's by no means necessary for correctnesss. Lastly, since kernels mostly rely on firmware to do MTRR setup, and the host typically provides guest firmware, honoring guest MTRRs is effectively honoring *host* userspace memtypes, which is also backwards. I.e. it would be far better for host userspace to communicate its desired memtype directly to KVM (or perhaps indirectly via VMAs in the host kernel), not through guest MTRRs. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
133 lines
2.8 KiB
C
133 lines
2.8 KiB
C
// SPDX-License-Identifier: GPL-2.0-only
|
|
/*
|
|
* vMTRR implementation
|
|
*
|
|
* Copyright (C) 2006 Qumranet, Inc.
|
|
* Copyright 2010 Red Hat, Inc. and/or its affiliates.
|
|
* Copyright(C) 2015 Intel Corporation.
|
|
*
|
|
* Authors:
|
|
* Yaniv Kamay <yaniv@qumranet.com>
|
|
* Avi Kivity <avi@qumranet.com>
|
|
* Marcelo Tosatti <mtosatti@redhat.com>
|
|
* Paolo Bonzini <pbonzini@redhat.com>
|
|
* Xiao Guangrong <guangrong.xiao@linux.intel.com>
|
|
*/
|
|
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
|
|
|
|
#include <linux/kvm_host.h>
|
|
#include <asm/mtrr.h>
|
|
|
|
#include "cpuid.h"
|
|
|
|
static u64 *find_mtrr(struct kvm_vcpu *vcpu, unsigned int msr)
|
|
{
|
|
int index;
|
|
|
|
switch (msr) {
|
|
case MTRRphysBase_MSR(0) ... MTRRphysMask_MSR(KVM_NR_VAR_MTRR - 1):
|
|
index = msr - MTRRphysBase_MSR(0);
|
|
return &vcpu->arch.mtrr_state.var[index];
|
|
case MSR_MTRRfix64K_00000:
|
|
return &vcpu->arch.mtrr_state.fixed_64k;
|
|
case MSR_MTRRfix16K_80000:
|
|
case MSR_MTRRfix16K_A0000:
|
|
index = msr - MSR_MTRRfix16K_80000;
|
|
return &vcpu->arch.mtrr_state.fixed_16k[index];
|
|
case MSR_MTRRfix4K_C0000:
|
|
case MSR_MTRRfix4K_C8000:
|
|
case MSR_MTRRfix4K_D0000:
|
|
case MSR_MTRRfix4K_D8000:
|
|
case MSR_MTRRfix4K_E0000:
|
|
case MSR_MTRRfix4K_E8000:
|
|
case MSR_MTRRfix4K_F0000:
|
|
case MSR_MTRRfix4K_F8000:
|
|
index = msr - MSR_MTRRfix4K_C0000;
|
|
return &vcpu->arch.mtrr_state.fixed_4k[index];
|
|
case MSR_MTRRdefType:
|
|
return &vcpu->arch.mtrr_state.deftype;
|
|
default:
|
|
break;
|
|
}
|
|
return NULL;
|
|
}
|
|
|
|
static bool valid_mtrr_type(unsigned t)
|
|
{
|
|
return t < 8 && (1 << t) & 0x73; /* 0, 1, 4, 5, 6 */
|
|
}
|
|
|
|
static bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)
|
|
{
|
|
int i;
|
|
u64 mask;
|
|
|
|
if (msr == MSR_MTRRdefType) {
|
|
if (data & ~0xcff)
|
|
return false;
|
|
return valid_mtrr_type(data & 0xff);
|
|
} else if (msr >= MSR_MTRRfix64K_00000 && msr <= MSR_MTRRfix4K_F8000) {
|
|
for (i = 0; i < 8 ; i++)
|
|
if (!valid_mtrr_type((data >> (i * 8)) & 0xff))
|
|
return false;
|
|
return true;
|
|
}
|
|
|
|
/* variable MTRRs */
|
|
if (WARN_ON_ONCE(!(msr >= MTRRphysBase_MSR(0) &&
|
|
msr <= MTRRphysMask_MSR(KVM_NR_VAR_MTRR - 1))))
|
|
return false;
|
|
|
|
mask = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
|
|
if ((msr & 1) == 0) {
|
|
/* MTRR base */
|
|
if (!valid_mtrr_type(data & 0xff))
|
|
return false;
|
|
mask |= 0xf00;
|
|
} else {
|
|
/* MTRR mask */
|
|
mask |= 0x7ff;
|
|
}
|
|
|
|
return (data & mask) == 0;
|
|
}
|
|
|
|
int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
|
|
{
|
|
u64 *mtrr;
|
|
|
|
mtrr = find_mtrr(vcpu, msr);
|
|
if (!mtrr)
|
|
return 1;
|
|
|
|
if (!kvm_mtrr_valid(vcpu, msr, data))
|
|
return 1;
|
|
|
|
*mtrr = data;
|
|
return 0;
|
|
}
|
|
|
|
int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
|
|
{
|
|
u64 *mtrr;
|
|
|
|
/* MSR_MTRRcap is a readonly MSR. */
|
|
if (msr == MSR_MTRRcap) {
|
|
/*
|
|
* SMRR = 0
|
|
* WC = 1
|
|
* FIX = 1
|
|
* VCNT = KVM_NR_VAR_MTRR
|
|
*/
|
|
*pdata = 0x500 | KVM_NR_VAR_MTRR;
|
|
return 0;
|
|
}
|
|
|
|
mtrr = find_mtrr(vcpu, msr);
|
|
if (!mtrr)
|
|
return 1;
|
|
|
|
*pdata = *mtrr;
|
|
return 0;
|
|
}
|