linux

iv/linux

Author	SHA1	Message	Date
Sean Christopherson	26b516bb39	x86/hyperv: KVM: Rename "hv_enlightenments" to "hv_vmcb_enlightenments" Now that KVM isn't littered with "struct hv_enlightenments" casts, rename the struct to "hv_vmcb_enlightenments" to highlight the fact that the struct is specifically for SVM's VMCB. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-18 12:58:59 -05:00
Sean Christopherson	68ae7c7bc5	KVM: SVM: Add a proper field for Hyper-V VMCB enlightenments Add a union to provide hv_enlightenments side-by-side with the sw_reserved bytes that Hyper-V's enlightenments overlay. Casting sw_reserved everywhere is messy, confusing, and unnecessarily unsafe. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-18 12:58:58 -05:00
Sean Christopherson	089fe572a2	x86/hyperv: Move VMCB enlightenment definitions to hyperv-tlfs.h Move Hyper-V's VMCB enlightenment definitions to the TLFS header; the definitions come directly from the TLFS[], not from KVM. No functional change intended. [] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_svm_enlightened_vmcb_fields [vitaly: rename VMCB_HV_ -> HV_VMCB_ to match the rest of hyperv-tlfs.h, keep svm/hyperv.h] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-18 12:58:57 -05:00
Paolo Bonzini	6c7b2202e4	KVM: x86: avoid memslot check in NX hugepage recovery if it cannot succeed Since gfn_to_memslot() is relatively expensive, it helps to skip it if it the memslot cannot possibly have dirty logging enabled. In order to do this, add to struct kvm a counter of the number of log-page memslots. While the correct value can only be read with slots_lock taken, the NX recovery thread is content with using an approximate value. Therefore, the counter is an atomic_t. Based on https://lore.kernel.org/kvm/20221027200316.2221027-2-dmatlack@google.com/ by David Matlack. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-18 11:30:12 -05:00
Paolo Bonzini	771a579c6e	Merge branch 'kvm-svm-harden' into HEAD This fixes three issues in nested SVM: 1) in the shutdown_interception() vmexit handler we call kvm_vcpu_reset(). However, if running nested and L1 doesn't intercept shutdown, the function resets vcpu->arch.hflags without properly leaving the nested state. This leaves the vCPU in inconsistent state and later triggers a kernel panic in SVM code. The same bug can likely be triggered by sending INIT via local apic to a vCPU which runs a nested guest. On VMX we are lucky that the issue can't happen because VMX always intercepts triple faults, thus triple fault in L2 will always be redirected to L1. Plus, handle_triple_fault() doesn't reset the vCPU. INIT IPI can't happen on VMX either because INIT events are masked while in VMX mode. Secondarily, KVM doesn't honour SHUTDOWN intercept bit of L1 on SVM. A normal hypervisor should always intercept SHUTDOWN, a unit test on the other hand might want to not do so. Finally, the guest can trigger a kernel non rate limited printk on SVM from the guest, which is fixed as well. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-17 11:51:09 -05:00
Maxim Levitsky	05311ce954	KVM: x86: remove exit_int_info warning in svm_handle_exit It is valid to receive external interrupt and have broken IDT entry, which will lead to #GP with exit_int_into that will contain the index of the IDT entry (e.g any value). Other exceptions can happen as well, like #NP or #SS (if stack switch fails). Thus this warning can be user triggred and has very little value. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-10-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-17 11:41:10 -05:00
Maxim Levitsky	92e7d5c83a	KVM: x86: allow L1 to not intercept triple fault This is SVM correctness fix - although a sane L1 would intercept SHUTDOWN event, it doesn't have to, so we have to honour this. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-17 11:39:59 -05:00
Maxim Levitsky	ed129ec905	KVM: x86: forcibly leave nested mode on vCPU reset While not obivous, kvm_vcpu_reset() leaves the nested mode by clearing 'vcpu->arch.hflags' but it does so without all the required housekeeping. On SVM, it is possible to have a vCPU reset while in guest mode because unlike VMX, on SVM, INIT's are not latched in SVM non root mode and in addition to that L1 doesn't have to intercept triple fault, which should also trigger L1's reset if happens in L2 while L1 didn't intercept it. If one of the above conditions happen, KVM will continue to use vmcb02 while not having in the guest mode. Later the IA32_EFER will be cleared which will lead to freeing of the nested guest state which will (correctly) free the vmcb02, but since KVM still uses it (incorrectly) this will lead to a use after free and kernel crash. This issue is assigned CVE-2022-3344 Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-17 11:39:57 -05:00
Maxim Levitsky	f9697df251	KVM: x86: add kvm_leave_nested add kvm_leave_nested which wraps a call to nested_ops->leave_nested into a function. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-17 11:39:56 -05:00
Maxim Levitsky	16ae56d7e0	KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while still in use Make sure that KVM uses vmcb01 before freeing nested state, and warn if that is not the case. This is a minimal fix for CVE-2022-3344 making the kernel print a warning instead of a kernel panic. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-17 11:39:54 -05:00
Maxim Levitsky	917401f26a	KVM: x86: nSVM: leave nested mode on vCPU free If the VM was terminated while nested, we free the nested state while the vCPU still is in nested mode. Soon a warning will be added for this condition. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-17 11:39:53 -05:00
David Matlack	eb29860570	KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages Do not recover (i.e. zap) an NX Huge Page that is being dirty tracked, as it will just be faulted back in at the same 4KiB granularity when accessed by a vCPU. This may need to be changed if KVM ever supports 2MiB (or larger) dirty tracking granularity, or faulting huge pages during dirty tracking for reads/executes. However for now, these zaps are entirely wasteful. In order to check if this commit increases the CPU usage of the NX recovery worker thread I used a modified version of execute_perf_test [1] that supports splitting guest memory into multiple slots and reports /proc/pid/schedstat:se.sum_exec_runtime for the NX recovery worker just before tearing down the VM. The goal was to force a large number of NX Huge Page recoveries and see if the recovery worker used any more CPU. Test Setup: echo 1000 > /sys/module/kvm/parameters/nx_huge_pages_recovery_period_ms echo 10 > /sys/module/kvm/parameters/nx_huge_pages_recovery_ratio Test Command: ./execute_perf_test -v64 -s anonymous_hugetlb_1gb -x 16 -o \| kvm-nx-lpage-re:se.sum_exec_runtime \| \| ---------------------------------------- \| Run \| Before \| After \| ------- \| ------------------ \| ------------------- \| 1 \| 730.084105 \| 724.375314 \| 2 \| 728.751339 \| 740.581988 \| 3 \| 736.264720 \| 757.078163 \| Comparing the median results, this commit results in about a 1% increase CPU usage of the NX recovery worker when testing a VM with 16 slots. However, the effect is negligible with the default halving time of NX pages, which is 1 hour rather than 10 seconds given by period_ms = 1000, ratio = 10. [1] https://lore.kernel.org/kvm/20221019234050.3919566-2-dmatlack@google.com/ Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20221103204421.1146958-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-17 11:26:35 -05:00
Paolo Bonzini	63d28a25e0	KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry A removed SPTE is never present, hence the "if" in kvm_tdp_mmu_map only fails in the exact same conditions that the earlier loop tested in order to issue a "break". So, instead of checking twice the condition (upper level SPTEs could not be created or was frozen), just exit the loop with a goto---the usual poor-man C replacement for RAII early returns. While at it, do not use the "ret" variable for return values of functions that do not return a RET_PF_* enum. This is clearer and also makes it possible to initialize ret to RET_PF_RETRY. Suggested-by: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-17 11:10:25 -05:00
David Matlack	c4b33d28ea	KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left unmapped and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. \| Config: ept=Y, tdp_mmu=Y, 5% writes \| \| Iteration 1 dirty memory time \| \| --------------------------------------------- \| vCPU Count \| eps=N (Before) \| eps=N (After) \| eps=Y \| ------------ \| -------------- \| ------------- \| ------------ \| 2 \| 0.332305091s \| 0.019615027s \| 0.006108211s \| 4 \| 0.353096020s \| 0.019452131s \| 0.006214670s \| 8 \| 0.453938562s \| 0.019748246s \| 0.006610997s \| 16 \| 0.719095024s \| 0.019972171s \| 0.007757889s \| 32 \| 1.698727124s \| 0.021361615s \| 0.012274432s \| 64 \| 2.630673582s \| 0.031122014s \| 0.016994683s \| 96 \| 3.016535213s \| 0.062608739s \| 0.044760838s \| Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. \| Config: ept=Y, tdp_mmu=Y, 100% writes \| \| Iteration 1 dirty memory time \| \| --------------------------------------------- \| vCPU Count \| eps=N (Before) \| eps=N (After) \| eps=Y \| ------------ \| -------------- \| ------------- \| ------------ \| 2 \| 0.317710329s \| 0.296204596s \| 0.058689782s \| 4 \| 0.337102375s \| 0.299841017s \| 0.060343076s \| 8 \| 0.386025681s \| 0.297274460s \| 0.060399702s \| 16 \| 0.791462524s \| 0.298942578s \| 0.062508699s \| 32 \| 1.719646014s \| 0.313101996s \| 0.075984855s \| 64 \| 2.527973150s \| 0.455779206s \| 0.079789363s \| 96 \| 2.681123208s \| 0.673778787s \| 0.165386739s \| Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-17 10:52:48 -05:00
David Matlack	d6ecfe976a	KVM: x86/mmu: Use BIT{,_ULL}() for PFERR masks Use the preferred BIT() and BIT_ULL() to construct the PFERR masks rather than open-coding the bit shifting. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221102184654.282799-6-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2022-11-16 16:59:00 -08:00
Paolo Bonzini	d663b8a285	KVM: replace direct irq.h inclusion virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source directory (i.e. not from arch/*/include) for the sole purpose of retrieving irqchip_in_kernel. Making the function inline in a header that is already included, such as asm/kvm_host.h, is not possible because it needs to look at struct kvm which is defined after asm/kvm_host.h is included. So add a kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is only performance critical on arm64 and x86, and the non-inline function is enough on all other architectures. irq.h can then be deleted from all architectures except x86. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:37 -05:00
Like Xu	de0f619564	KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter Defer reprogramming counters and handling overflow via KVM_REQ_PMU when incrementing counters. KVM skips emulated WRMSR in the VM-Exit fastpath, the fastpath runs with IRQs disabled, skipping instructions can increment and reprogram counters, reprogramming counters can sleep, and sleeping is disallowed while IRQs are disabled. [] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 [] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM [] preempt_count: 1, expected: 0 [] RCU nest depth: 0, expected: 0 [] INFO: lockdep is turned off. [] irq event stamp: 0 [] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0 [] softirqs last enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0 [] softirqs last disabled at (0): [<0000000000000000>] 0x0 [] Preemption disabled at: [] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm] [] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2 [] Call Trace: [] <TASK> [] dump_stack_lvl+0x6c/0x9b [] __might_resched.cold+0x22e/0x297 [] __mutex_lock+0xc0/0x23b0 [] perf_event_ctx_lock_nested+0x18f/0x340 [] perf_event_pause+0x1a/0x110 [] reprogram_counter+0x2af/0x1490 [kvm] [] kvm_pmu_trigger_event+0x429/0x950 [kvm] [] kvm_skip_emulated_instruction+0x48/0x90 [kvm] [] handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm] [] vmx_vcpu_run+0x268e/0x3b80 [kvm_intel] [] vcpu_enter_guest+0x1d22/0x3dc0 [kvm] Add a field to kvm_pmc to track the previous counter value in order to defer overflow detection to kvm_pmu_handle_event() (the counter must be paused before handling overflow, and that may increment the counter). Opportunistically shrink sizeof(struct kvm_pmc) a bit. Suggested-by: Wanpeng Li <wanpengli@tencent.com> Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20220831085328.45489-6-likexu@tencent.com [sean: avoid re-triggering KVM_REQ_PMU on overflow, tweak changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:36 -05:00
Like Xu	68fb4757e8	KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event() Batch reprogramming PMU counters by setting KVM_REQ_PMU and thus deferring reprogramming kvm_pmu_handle_event() to avoid reprogramming a counter multiple times during a single VM-Exit. Deferring programming will also allow KVM to fix a bug where immediately reprogramming a counter can result in sleeping (taking a mutex) while interrupts are disabled in the VM-Exit fastpath. Introduce kvm_pmu_request_counter_reprogam() to make it obvious that KVM is _requesting_ a reprogram and not actually doing the reprogram. Opportunistically refine related comments to avoid misunderstandings. Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20220831085328.45489-5-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:36 -05:00
Sean Christopherson	dcbb816a28	KVM: x86/pmu: Clear "reprogram" bit if counter is disabled or disallowed When reprogramming a counter, clear the counter's "reprogram pending" bit if the counter is disabled (by the guest) or is disallowed (by the userspace filter). In both cases, there's no need to re-attempt programming on the next coincident KVM_REQ_PMU as enabling the counter by either method will trigger reprogramming. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:35 -05:00
Sean Christopherson	f1c5651fda	KVM: x86/pmu: Force reprogramming of all counters on PMU filter change Force vCPUs to reprogram all counters on a PMU filter change to provide a sane ABI for userspace. Use the existing KVM_REQ_PMU to do the programming, and take advantage of the fact that the reprogram_pmi bitmap fits in a u64 to set all bits in a single atomic update. Note, setting the bitmap and making the request needs to be done _after_ the SRCU synchronization to ensure that vCPUs will reprogram using the new filter. KVM's current "lazy" approach is confusing and non-deterministic. It's confusing because, from a developer perspective, the code is buggy as it makes zero sense to let userspace modify the filter but then not actually enforce the new filter. The lazy approach is non-deterministic because KVM enforces the filter whenever a counter is reprogrammed, not just on guest WRMSRs, i.e. a guest might gain/lose access to an event at random times depending on what is going on in the host. Note, the resulting behavior is still non-determinstic while the filter is in flux. If userspace wants to guarantee deterministic behavior, all vCPUs should be paused during the filter update. Jim Mattson <jmattson@google.com> Fixes: 66bb8a065f5a ("KVM: x86: PMU Event Filter") Cc: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:35 -05:00
Sean Christopherson	3a05675722	KVM: x86/mmu: WARN if TDP MMU SP disallows hugepage after being zapped Extend the accounting sanity check in kvm_recover_nx_huge_pages() to the TDP MMU, i.e. verify that zapping a shadow page unaccounts the disallowed NX huge page regardless of the MMU type. Recovery runs while holding mmu_lock for write and so it should be impossible to get false positives on the WARN. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:34 -05:00
Mingwei Zhang	76901e56fb	KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() Explicitly check if a NX huge page is disallowed when determining if a page fault needs to be forced to use a smaller sized page. KVM currently assumes that the NX huge page mitigation is the only scenario where KVM will force a shadow page instead of a huge page, and so unnecessarily keeps an existing shadow page instead of replacing it with a huge page. Any scenario that causes KVM to zap leaf SPTEs may result in having a SP that can be made huge without violating the NX huge page mitigation. E.g. prior to commit 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging"), KVM would keep shadow pages after disabling dirty logging due to a live migration being canceled, resulting in degraded performance due to running with 4kb pages instead of huge pages. Although the dirty logging case is "fixed", that fix is coincidental, i.e. is an implementation detail, and there are other scenarios where KVM will zap leaf SPTEs. E.g. zapping leaf SPTEs in response to a host page migration (mmu_notifier invalidation) to create a huge page would yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume a huge page is disallowed. Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: use spte_to_child_sp(), massage changelog, fold into if-statement] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:34 -05:00
Sean Christopherson	5e3edd7e8b	KVM: x86/mmu: Add helper to convert SPTE value to its shadow page Add a helper to convert a SPTE to its shadow page to deduplicate a variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to get the shadow page for a SPTE without dropping high bits. Opportunistically add a comment in mmu_free_root_page() documenting why it treats the root HPA as a SPTE. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:33 -05:00
Sean Christopherson	d25ceb9264	KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages Track the number of TDP MMU "shadow" pages instead of tracking the pages themselves. With the NX huge page list manipulation moved out of the common linking flow, elminating the list-based tracking means the happy path of adding a shadow page doesn't need to acquire a spinlock and can instead inc/dec an atomic. Keep the tracking as the WARN during TDP MMU teardown on leaked shadow pages is very, very useful for detecting KVM bugs. Tracking the number of pages will also make it trivial to expose the counter to userspace as a stat in the future, which may or may not be desirable. Note, the TDP MMU needs to use a separate counter (and stat if that ever comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the TDP MMU consumes so few pages relative to shadow paging), and including TDP MMU pages in that counter would break both the shrinker and shadow MMUs, e.g. if a VM is using nested TDP. Cc: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:33 -05:00
Sean Christopherson	61f9447854	KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP visible to other readers, i.e. before setting its SPTE. This will allow KVM to query the flag when determining if a shadow page can be replaced by a NX huge page without violating the rules of the mitigation. Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible for another CPU to see a shadow page without an up-to-date nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated dance. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:32 -05:00
Sean Christopherson	b5b0977f4a	KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs Account and track NX huge pages for nonpaging MMUs so that a future enhancement to precisely check if a shadow page can't be replaced by a NX huge page doesn't get false positives. Without correct tracking, KVM can get stuck in a loop if an instruction is fetching and writing data on the same huge page, e.g. KVM installs a small executable page on the fetch fault, replaces it with an NX huge page on the write fault, and faults again on the fetch. Alternatively, and perhaps ideally, KVM would simply not enforce the workaround for nonpaging MMUs. The guest has no page tables to abuse and KVM is guaranteed to switch to a different MMU on CR0.PG being toggled so there's no security or performance concerns. However, getting make_spte() to play nice now and in the future is unnecessarily complex. In the current code base, make_spte() can enforce the mitigation if TDP is enabled or the MMU is indirect, but make_spte() may not always have a vCPU/MMU to work with, e.g. if KVM were to support in-line huge page promotion when disabling dirty logging. Without a vCPU/MMU, KVM could either pass in the correct information and/or derive it from the shadow page, but the former is ugly and the latter subtly non-trivial due to the possibility of direct shadow pages in indirect MMUs. Given that using shadow paging with an unpaged guest is far from top priority _and_ has been subjected to the workaround since its inception, keep it simple and just fix the accounting glitch. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221019165618.927057-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:32 -05:00
Sean Christopherson	55c510e26a	KVM: x86/mmu: Rename NX huge pages fields/functions for consistency Rename most of the variables/functions involved in the NX huge page mitigation to provide consistency, e.g. lpage vs huge page, and NX huge vs huge NX, and also to provide clarity, e.g. to make it obvious the flag applies only to the NX huge page mitigation, not to any condition that prevents creating a huge page. Add a comment explaining what the newly named "possible_nx_huge_pages" tracks. Leave the nx_lpage_splits stat alone as the name is ABI and thus set in stone. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221019165618.927057-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:31 -05:00
Sean Christopherson	428e921611	KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked Tag shadow pages that cannot be replaced with an NX huge page regardless of whether or not zapping the page would allow KVM to immediately create a huge page, e.g. because something else prevents creating a huge page. I.e. track pages that are disallowed from being NX huge pages regardless of whether or not the page could have been huge at the time of fault. KVM currently tracks pages that were disallowed from being huge due to the NX workaround if and only if the page could otherwise be huge. But that fails to handled the scenario where whatever restriction prevented KVM from installing a huge page goes away, e.g. if dirty logging is disabled, the host mapping level changes, etc... Failure to tag shadow pages appropriately could theoretically lead to false negatives, e.g. if a fetch fault requests a small page and thus isn't tracked, and a read/write fault later requests a huge page, KVM will not reject the huge page as it should. To avoid yet another flag, initialize the list_head and use list_empty() to determine whether or not a page is on the list of NX huge pages that should be recovered. Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is more involved due to mmu_lock being held for read. This will be addressed in a future commit. Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:31 -05:00
Aaron Lewis	8aff460f21	KVM: x86: Add a VALID_MASK for the flags in kvm_msr_filter_range Add the mask KVM_MSR_FILTER_RANGE_VALID_MASK for the flags in the struct kvm_msr_filter_range. This simplifies checks that validate these flags, and makes it easier to introduce new flags in the future. No functional change intended. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220921151525.904162-5-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:30 -05:00
Aaron Lewis	c1340fe359	KVM: x86: Add a VALID_MASK for the flag in kvm_msr_filter Add the mask KVM_MSR_FILTER_VALID_MASK for the flag in the struct kvm_msr_filter. This makes it easier to introduce new flags in the future. No functional change intended. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220921151525.904162-4-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:29 -05:00
Aaron Lewis	db205f7e1e	KVM: x86: Add a VALID_MASK for the MSR exit reason flags Add the mask KVM_MSR_EXIT_REASON_VALID_MASK for the MSR exit reason flags. This simplifies checks that validate these flags, and makes it easier to introduce new flags in the future. No functional change intended. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220921151525.904162-3-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:29 -05:00
Aaron Lewis	be83794210	KVM: x86: Disallow the use of KVM_MSR_FILTER_DEFAULT_ALLOW in the kernel Protect the kernel from using the flag KVM_MSR_FILTER_DEFAULT_ALLOW. Its value is 0, and using it incorrectly could have unintended consequences. E.g. prevent someone in the kernel from writing something like this. if (filter.flags & KVM_MSR_FILTER_DEFAULT_ALLOW) <allow the MSR> and getting confused when it doesn't work. It would be more ideal to remove this flag altogether, but userspace may already be using it, so protecting the kernel is all that can reasonably be done at this point. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921151525.904162-2-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:28 -05:00
Peter Xu	766576874b	kvm: x86: Allow to respond to generic signals during slow PF Enable x86 slow page faults to be able to respond to non-fatal signals, returning -EINTR properly when it happens. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221011195947.557281-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:28 -05:00
Peter Xu	c8b88b332b	kvm: Add interruptible flag to __gfn_to_pfn_memslot() Add a new "interruptible" flag showing that the caller is willing to be interrupted by signals during the __gfn_to_pfn_memslot() request. Wire it up with a FOLL_INTERRUPTIBLE flag that we've just introduced. This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the SIGIPI) even during e.g. handling an userfaultfd page fault. No functional change intended. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221011195809.557016-4-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:27 -05:00
Maxim Levitsky	fb28875fd7	KVM: x86: smm: preserve interrupt shadow in SMRAM When #SMI is asserted, the CPU can be in interrupt shadow due to sti or mov ss. It is not mandatory in Intel/AMD prm to have the #SMI blocked during the shadow, and on top of that, since neither SVM nor VMX has true support for SMI window, waiting for one instruction would mean single stepping the guest. Instead, allow #SMI in this case, but both reset the interrupt window and stash its value in SMRAM to restore it on exit from SMM. This fixes rare failures seen mostly on windows guests on VMX, when #SMI falls on the sti instruction which mainfest in VM entry failure due to EFLAGS.IF not being set, but STI interrupt window still being set in the VMCS. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-24-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:26 -05:00
Maxim Levitsky	95504c7c98	KVM: x86: SVM: don't save SVM state to SMRAM when VM is not long mode capable When the guest CPUID doesn't have support for long mode, 32 bit SMRAM layout is used and it has no support for preserving EFER and/or SVM state. Note that this isn't relevant to running 32 bit guests on VM which is long mode capable - such VM can still run 32 bit guests in compatibility mode. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-23-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:25 -05:00
Maxim Levitsky	dd5045fed5	KVM: x86: SVM: use smram structs Use SMM structs in the SVM code as well, which removes the last user of put_smstate/GET_SMSTATE so remove these macros as well. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-22-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:25 -05:00
Maxim Levitsky	e6a82199b6	KVM: svm: drop explicit return value of kvm_vcpu_map if kvm_vcpu_map returns non zero value, error path should be triggered regardless of the exact returned error value. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-21-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:24 -05:00
Maxim Levitsky	8bcda1dee9	KVM: x86: smm: use smram struct for 64 bit smram load/restore Use kvm_smram_state_64 struct to save/restore the 64 bit SMM state (used when X86_FEATURE_LM is present in the guest CPUID, regardless of 32-bitness of the guest). Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-20-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:24 -05:00
Maxim Levitsky	f34bdf4c17	KVM: x86: smm: use smram struct for 32 bit smram load/restore Use kvm_smram_state_32 struct to save/restore 32 bit SMM state (used when X86_FEATURE_LM is not present in the guest CPUID). Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-19-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:23 -05:00
Maxim Levitsky	58c1d206d5	KVM: x86: smm: use smram structs in the common code Use kvm_smram union instad of raw arrays in the common smm code. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-18-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:23 -05:00
Maxim Levitsky	09779c16e3	KVM: x86: smm: add structs for KVM's smram layout Add structs that will be used to define and read/write the KVM's SMRAM layout, instead of reading/writing to raw offsets. Also document the differences between KVM's SMRAM layout and SMRAM layout that is used by real Intel/AMD cpus. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-17-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:22 -05:00
Maxim Levitsky	89dccf82e9	KVM: x86: smm: check for failures on smm entry In the rare case of the failure on SMM entry, the KVM should at least terminate the VM instead of going south. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-16-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:22 -05:00
Paolo Bonzini	a7662aa5e5	KVM: x86: do not define SMM-related constants if SMM disabled The hidden processor flags HF_SMM_MASK and HF_SMM_INSIDE_NMI_MASK are not needed if CONFIG_KVM_SMM is turned off. Remove the definitions altogether and the code that uses them. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:21 -05:00
Paolo Bonzini	85672346a7	KVM: zero output of KVM_GET_VCPU_EVENTS before filling in the struct This allows making some fields optional, as will be the case soon for SMM-related data. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:21 -05:00
Paolo Bonzini	cf7316d036	KVM: x86: do not define KVM_REQ_SMI if SMM disabled This ensures that all the relevant code is compiled out, in fact the process_smi stub can be removed too. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220929172016.319443-9-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:20 -05:00
Paolo Bonzini	ba97bb07e0	KVM: x86: remove SMRAM address space if SMM is not supported If CONFIG_KVM_SMM is not defined HF_SMM_MASK will always be zero, and we can spare userspace the hassle of setting up the SMRAM address space simply by reporting that only one address space is supported. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220929172016.319443-8-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:20 -05:00
Paolo Bonzini	31e83e21cf	KVM: x86: compile out vendor-specific code if SMM is disabled Vendor-specific code that deals with SMI injection and saving/restoring SMM state is not needed if CONFIG_KVM_SMM is disabled, so remove the four callbacks smi_allowed, enter_smm, leave_smm and enable_smi_window. The users in svm/nested.c and x86.c also have to be compiled out; the amount of #ifdef'ed code is small and it's not worth moving it to smm.c. enter_smm is now used only within #ifdef CONFIG_KVM_SMM, and the stub can therefore be removed. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220929172016.319443-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:19 -05:00
Paolo Bonzini	4b8e1b3201	KVM: allow compiling out SMM support Some users of KVM implement the UEFI variable store through a paravirtual device that does not require the "SMM lockbox" component of edk2; allow them to compile out system management mode, which is not a full implementation especially in how it interacts with nested virtualization. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220929172016.319443-6-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:19 -05:00
Paolo Bonzini	1d0da94cda	KVM: x86: do not go through ctxt->ops when emulating rsm Now that RSM is implemented in a single emulator callback, there is no point in going through other callbacks for the sake of modifying processor state. Just invoke KVM's own internal functions directly, and remove the callbacks that were only used by em_rsm; the only substantial difference is in the handling of the segment registers and descriptor cache, which have to be parsed into a struct kvm_segment instead of a struct desc_struct. This also fixes a bug where emulator_set_segment was shifting the limit left by 12 if the G bit is set, but the limit had not been shifted right upon entry to SMM. The emulator context is still used to restore EIP and the general purpose registers. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220929172016.319443-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:18 -05:00

1 2 3 4 5 ...

42601 Commits