2236 Commits

Author SHA1 Message Date
Sean Christopherson
5ebbc470d7 KVM: x86: Remove defunct setting of CR0.ET for guests during vCPU create
Drop code to set CR0.ET for the guest during initialization of the guest
FPU.  The code was added as a misguided bug fix by commit 380102c8e431
("KVM Set the ET flag in CR0 after initializing FX") to resolve an issue
where vcpu->cr0 (now vcpu->arch.cr0) was not correctly initialized on SVM
systems.  While init_vmcb() did set CR0.ET, it only did so in the VMCB,
and subtly did not update vcpu->cr0.  Stuffing CR0.ET worked around the
immediate problem, but did not fix the real bug of vcpu->cr0 and the VMCB
being out of sync.  That underlying bug was eventually remedied by commit
18fa000ae453 ("KVM: SVM: Reset cr0 properly on vcpu reset").

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:06 -04:00
Sean Christopherson
ff8828c84f KVM: x86: Do not mark all registers as avail/dirty during RESET/INIT
Do not blindly mark all registers as available+dirty at RESET/INIT, and
instead rely on writes to registers to go through the proper mutators or
to explicitly mark registers as dirty.  INIT in particular does not blindly
overwrite all registers, e.g. select bits in CR0 are preserved across INIT,
thus marking registers available+dirty without first reading the register
from hardware is incorrect.

In practice this is a benign bug as KVM doesn't let the guest control CR0
bits that are preserved across INIT, and all other true registers are
explicitly written during the RESET/INIT flows.  The PDPTRs and EX_INFO
"registers" are not explicitly written, but accessing those values during
RESET/INIT is nonsensical and would be a KVM bug regardless of register
caching.

Fixes: 66f7b72e1171 ("KVM: x86: Make register state after reset conform to specification")
[sean: !!! NOT FOR STABLE !!!]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:05 -04:00
Sean Christopherson
94c641ba7a KVM: x86: Simplify retrieving the page offset when loading PDTPRs
Replace impressively complex "logic" for computing the page offset from
CR3 when loading PDPTRs.  Unlike other paging modes, the address held in
CR3 for PAE paging is 32-byte aligned, i.e. occupies bits 31:5, thus bits
11:5 need to be used as the offset from the gfn when reading PDPTRs.

The existing calculation originated in commit 1342d3536d6a ("[PATCH] KVM:
MMU: Load the pae pdptrs on cr3 change like the processor does"), which
read the PDPTRs from guest memory as individual 8-byte loads.  At the
time, the so called "offset" was the base index of PDPTR0 as a _u64_, not
a byte offset.  Naming aside, the computation was useful and arguably
simplified the overall flow.

Unfortunately, when commit 195aefde9cc2 ("KVM: Add general accessors to
read and write guest memory") added accessors with offsets at byte
granularity, the cleverness of the original code was lost and KVM was
left with convoluted code for a simple operation.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210831164224.1119728-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:05 -04:00
Sean Christopherson
15cabbc259 KVM: x86: Subsume nested GPA read helper into load_pdptrs()
Open code the call to mmu->translate_gpa() when loading nested PDPTRs and
kill off the existing helper, kvm_read_guest_page_mmu(), to discourage
incorrect use.  Reading guest memory straight from an L2 GPA is extremely
rare (as evidenced by the lack of users), as very few constructs in x86
specify physical addresses, even fewer are virtualized by KVM, and even
fewer yet require emulation of L2 by L0 KVM.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210831164224.1119728-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:05 -04:00
Juergen Gross
a1c42ddedf kvm: rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS
KVM_MAX_VCPU_ID is not specifying the highest allowed vcpu-id, but the
number of allowed vcpu-ids. This has already led to confusion, so
rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS to make its semantics more
clear

Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210913135745.13944-3-jgross@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:05 -04:00
Vitaly Kuznetsov
620b2438ab KVM: Make kvm_make_vcpus_request_mask() use pre-allocated cpu_kick_mask
kvm_make_vcpus_request_mask() already disables preemption so just like
kvm_make_all_cpus_request_except() it can be switched to using
pre-allocated per-cpu cpumasks. This allows for improvements for both
users of the function: in Hyper-V emulation code 'tlb_flush' can now be
dropped from 'struct kvm_vcpu_hv' and kvm_make_scan_ioapic_request_mask()
gets rid of dynamic allocation.

cpumask_available() checks in kvm_make_vcpu_request() and
kvm_kick_many_cpus() can now be dropped as they checks for an impossible
condition: kvm_init() makes sure per-cpu masks are allocated.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-9-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:04 -04:00
Vitaly Kuznetsov
381cecc5d7 KVM: Drop 'except' parameter from kvm_make_vcpus_request_mask()
Both remaining callers of kvm_make_vcpus_request_mask() pass 'NULL' for
'except' parameter so it can just be dropped.

No functional change intended ©.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:04 -04:00
Fares Mehanna
e1fc1553cd kvm: x86: Add AMD PMU MSRs to msrs_to_save_all[]
Intel PMU MSRs is in msrs_to_save_all[], so add AMD PMU MSRs to have a
consistent behavior between Intel and AMD when using KVM_GET_MSRS,
KVM_SET_MSRS or KVM_GET_MSR_INDEX_LIST.

We have to add legacy and new MSRs to handle guests running without
X86_FEATURE_PERFCTR_CORE.

Signed-off-by: Fares Mehanna <faresx@amazon.de>
Message-Id: <20210915133951.22389-1-faresx@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 11:30:14 -04:00
Maxim Levitsky
37687c403a KVM: x86: reset pdptrs_from_userspace when exiting smm
When exiting SMM, pdpts are loaded again from the guest memory.

This fixes a theoretical bug, when exit from SMM triggers entry to the
nested guest which re-uses some of the migration
code which uses this flag as a workaround for a legacy userspace.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:16 -04:00
Sean Christopherson
94c245a245 KVM: x86: Identify vCPU0 by its vcpu_idx instead of its vCPUs array entry
Use vcpu_idx to identify vCPU0 when updating HyperV's TSC page, which is
shared by all vCPUs and "owned" by vCPU0 (because vCPU0 is the only vCPU
that's guaranteed to exist).  Using kvm_get_vcpu() to find vCPU works,
but it's a rather odd and suboptimal method to check the index of a given
vCPU.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210910183220.2397812-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:12 -04:00
Haimin Zhang
eb7511bf91 KVM: x86: Handle SRCU initialization failure during page track init
Check the return of init_srcu_struct(), which can fail due to OOM, when
initializing the page track mechanism.  Lack of checking leads to a NULL
pointer deref found by a modified syzkaller.

Reported-by: TCS Robot <tcs_robot@tencent.com>
Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com>
Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com>
[Move the call towards the beginning of kvm_arch_init_vm. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:09 -04:00
Sean Christopherson
03a6e84069 KVM: x86: Clear KVM's cached guest CR3 at RESET/INIT
Explicitly zero the guest's CR3 and mark it available+dirty at RESET/INIT.
Per Intel's SDM and AMD's APM, CR3 is zeroed at both RESET and INIT.  For
RESET, this is a nop as vcpu is zero-allocated.  For INIT, the bug has
likely escaped notice because no firmware/kernel puts its page tables root
at PA=0, let alone relies on INIT to get the desired CR3 for such page
tables.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210921000303.400537-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:08 -04:00
Sean Christopherson
7117003fe4 KVM: x86: Mark all registers as avail/dirty at vCPU creation
Mark all registers as available and dirty at vCPU creation, as the vCPU has
obviously not been loaded into hardware, let alone been given the chance to
be modified in hardware.  On SVM, reading from "uninitialized" hardware is
a non-issue as VMCBs are zero allocated (thus not truly uninitialized) and
hardware does not allow for arbitrary field encoding schemes.

On VMX, backing memory for VMCSes is also zero allocated, but true
initialization of the VMCS _technically_ requires VMWRITEs, as the VMX
architectural specification technically allows CPU implementations to
encode fields with arbitrary schemes.  E.g. a CPU could theoretically store
the inverted value of every field, which would result in VMREAD to a
zero-allocated field returns all ones.

In practice, only the AR_BYTES fields are known to be manipulated by
hardware during VMREAD/VMREAD; no known hardware or VMM (for nested VMX)
does fancy encoding of cacheable field values (CR0, CR3, CR4, etc...).  In
other words, this is technically a bug fix, but practically speakings it's
a glorified nop.

Failure to mark registers as available has been a lurking bug for quite
some time.  The original register caching supported only GPRs (+RIP, which
is kinda sorta a GPR), with the masks initialized at ->vcpu_reset().  That
worked because the two cacheable registers, RIP and RSP, are generally
speaking not read as side effects in other flows.

Arguably, commit aff48baa34c0 ("KVM: Fetch guest cr3 from hardware on
demand") was the first instance of failure to mark regs available.  While
_just_ marking CR3 available during vCPU creation wouldn't have fixed the
VMREAD from an uninitialized VMCS bug because ept_update_paging_mode_cr0()
unconditionally read vmcs.GUEST_CR3, marking CR3 _and_ intentionally not
reading GUEST_CR3 when it's available would have avoided VMREAD to a
technically-uninitialized VMCS.

Fixes: aff48baa34c0 ("KVM: Fetch guest cr3 from hardware on demand")
Fixes: 6de4f3ada40b ("KVM: Cache pdptrs")
Fixes: 6de12732c42c ("KVM: VMX: Optimize vmx_get_rflags()")
Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields")
Fixes: bd31fe495d0d ("KVM: VMX: Add proper cache tracking for CR0")
Fixes: f98c1e77127d ("KVM: VMX: Add proper cache tracking for CR4")
Fixes: 5addc235199f ("KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags")
Fixes: 8791585837f6 ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210921000303.400537-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:07 -04:00
Zelin Deng
d9130a2dfd KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted
When MSR_IA32_TSC_ADJUST is written by guest due to TSC ADJUST feature
especially there's a big tsc warp (like a new vCPU is hot-added into VM
which has been up for a long time), tsc_offset is added by a large value
then go back to guest. This causes system time jump as tsc_timestamp is
not adjusted in the meantime and pvclock monotonic character.
To fix this, just notify kvm to update vCPU's guest time before back to
guest.

Cc: stable@vger.kernel.org
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1619576521-81399-2-git-send-email-zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-06 07:07:03 -04:00
Maxim Levitsky
61e5f69ef0 KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ
KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts
while running.

This change is mostly intended for more robust single stepping
of the guest and it has the following benefits when enabled:

* Resuming from a breakpoint is much more reliable.
  When resuming execution from a breakpoint, with interrupts enabled,
  more often than not, KVM would inject an interrupt and make the CPU
  jump immediately to the interrupt handler and eventually return to
  the breakpoint, to trigger it again.

  From the user point of view it looks like the CPU never executed a
  single instruction and in some cases that can even prevent forward
  progress, for example, when the breakpoint is placed by an automated
  script (e.g lx-symbols), which does something in response to the
  breakpoint and then continues the guest automatically.
  If the script execution takes enough time for another interrupt to
  arrive, the guest will be stuck on the same breakpoint RIP forever.

* Normal single stepping is much more predictable, since it won't
  land the debugger into an interrupt handler.

* RFLAGS.TF has less chance to be leaked to the guest:

  We set that flag behind the guest's back to do single stepping
  but if single step lands us into an interrupt/exception handler
  it will be leaked to the guest in the form of being pushed
  to the stack.
  This doesn't completely eliminate this problem as exceptions
  can still happen, but at least this reduces the chances
  of this happening.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210811122927.900604-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:37 -04:00
Mingwei Zhang
71f51d2c32 KVM: x86/mmu: Add detailed page size stats
Existing KVM code tracks the number of large pages regardless of their
sizes. Therefore, when large page of 1GB (or larger) is adopted, the
information becomes less useful because lpages counts a mix of 1G and 2M
pages.

So remove the lpages since it is easy for user space to aggregate the info.
Instead, provide a comprehensive page stats of all sizes from 4K to 512G.

Suggested-by: Ben Gardon <bgardon@google.com>

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Cc: Jing Zhang <jingzhangos@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20210803044607.599629-4-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:34 -04:00
Jing Zhang
f95937ccf5 KVM: stats: Support linear and logarithmic histogram statistics
Add new types of KVM stats, linear and logarithmic histogram.
Histogram are very useful for observing the value distribution
of time or size related stats.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210802165633.1866976-2-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:32 -04:00
Maxim Levitsky
06ef813466 KVM: SVM: avoid refreshing avic if its state didn't change
Since AVIC can be inhibited and uninhibited rapidly it is possible that
we have nothing to do by the time the svm_refresh_apicv_exec_ctrl
is called.

Detect and avoid this, which will be useful when we will start calling
avic_vcpu_load/avic_vcpu_put when the avic inhibition state changes.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-14-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:27 -04:00
Maxim Levitsky
b0a1637f64 KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM
Currently on SVM, the kvm_request_apicv_update toggles the APICv
memslot without doing any synchronization.

If there is a mismatch between that memslot state and the AVIC state,
on one of the vCPUs, an APIC mmio access can be lost:

For example:

VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT
VCPU1: access an APIC mmio register.

Since AVIC is still disabled on VCPU1, the access will not be intercepted
by it, and neither will it cause MMIO fault, but rather it will just be
read/written from/to the dummy page mapped into the
APIC_ACCESS_PAGE_PRIVATE_MEMSLOT.

Fix that by adding a lock guarding the AVIC state changes, and carefully
order the operations of kvm_request_apicv_update to avoid this race:

1. Take the lock
2. Send KVM_REQ_APICV_UPDATE
3. Update the apic inhibit reason
4. Release the lock

This ensures that at (2) all vCPUs are kicked out of the guest mode,
but don't yet see the new avic state.
Then only after (4) all other vCPUs can update their AVIC state and resume.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-10-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:23 -04:00
Maxim Levitsky
36222b117e KVM: x86: don't disable APICv memslot when inhibited
Thanks to the former patches, it is now possible to keep the APICv
memslot always enabled, and it will be invisible to the guest
when it is inhibited

This code is based on a suggestion from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-9-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:22 -04:00
Peter Xu
4139b1972a KVM: X86: Introduce kvm_mmu_slot_lpages() helpers
Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size.
The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather
than fetching from the memslot pointer.  Start to use the latter one in
kvm_alloc_memslot_metadata().

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:04:51 -04:00
Sean Christopherson
ad0577c375 KVM: x86: Kill off __ex() and __kvm_handle_fault_on_reboot()
Remove the __kvm_handle_fault_on_reboot() and __ex() macros now that all
VMX and SVM instructions use asm goto to handle the fault (or in the
case of VMREAD, completely custom logic).  Drop kvm_spurious_fault()'s
asmlinkage annotation as __kvm_handle_fault_on_reboot() was the only
flow that invoked it from assembly code.

Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210809173955.1710866-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:16 -04:00
Paolo Bonzini
1ccb6f983a KVM: VMX: Reset DR6 only when KVM_DEBUGREG_WONT_EXIT
The commit efdab992813fb ("KVM: x86: fix escape of guest dr6 to the host")
fixed a bug by resetting DR6 unconditionally when the vcpu being scheduled out.

But writing to debug registers is slow, and it can be visible in perf results
sometimes, even if neither the host nor the guest activate breakpoints.

Since KVM_DEBUGREG_WONT_EXIT on Intel processors is the only case
where DR6 gets the guest value, and it never happens at all on SVM,
the register can be cleared in vmx.c right after reading it.

Reported-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:14 -04:00
Paolo Bonzini
375e28ffc0 KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT
Commit c77fb5fe6f03 ("KVM: x86: Allow the guest to run with dirty debug
registers") allows the guest accessing to DRs without exiting when
KVM_DEBUGREG_WONT_EXIT and we need to ensure that they are synchronized
on entry to the guest---including DR6 that was not synced before the commit.

But the commit sets the hardware DR6 not only when KVM_DEBUGREG_WONT_EXIT,
but also when KVM_DEBUGREG_BP_ENABLED.  The second case is unnecessary
and just leads to a more case which leaks stale DR6 to the host which has
to be resolved by unconditionally reseting DR6 in kvm_arch_vcpu_put().

Even if KVM_DEBUGREG_WONT_EXIT, however, setting the host DR6 only matters
on VMX because SVM always uses the DR6 value from the VMCB.  So move this
line to vmx.c and make it conditional on KVM_DEBUGREG_WONT_EXIT.

Reported-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:14 -04:00
Lai Jiangshan
34e9f86007 KVM: X86: Remove unneeded KVM_DEBUGREG_RELOAD
Commit ae561edeb421 ("KVM: x86: DR0-DR3 are not clear on reset") added code to
ensure eff_db are updated when they're modified through non-standard paths.

But there is no reason to also update hardware DRs unless hardware breakpoints
are active or DR exiting is disabled, and in those cases updating hardware is
handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED.

KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better
to be removed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210809174307.145263-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:14 -04:00
Paolo Bonzini
319afe6856 KVM: xen: do not use struct gfn_to_hva_cache
gfn_to_hva_cache is not thread-safe, so it is usually used only within
a vCPU (whose code is protected by vcpu->mutex).  The Xen interface
implementation has such a cache in kvm->arch, but it is not really
used except to store the location of the shared info page.  Replace
shinfo_set and shinfo_cache with just the value that is passed via
KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the
initialization value is not zero anymore and therefore kvm_xen_init_vm
needs to be introduced.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-05 03:31:40 -04:00
Hamza Mahfooz
269e9552d2 KVM: const-ify all relevant uses of struct kvm_memory_slot
As alluded to in commit f36f3f2846b5 ("KVM: add "new" argument to
kvm_arch_commit_memory_region"), a bunch of other places where struct
kvm_memory_slot is used, needs to be refactored to preserve the
"const"ness of struct kvm_memory_slot across-the-board.

Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
Message-Id: <20210713023338.57108-1-someguy@effective-light.com>
[Do not touch body of slot_rmap_walk_init. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03 06:04:24 -04:00
Sean Christopherson
4c72ab5aa6 KVM: x86: Preserve guest's CR0.CD/NW on INIT
Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as
defined by both Intel's SDM and AMD's APM.

Note, current versions of Intel's SDM are very poorly written with
respect to INIT behavior.  Table 9-1. "IA-32 and Intel 64 Processor
States Following Power-up, Reset, or INIT" quite clearly lists power-up,
RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1.  But the SDM
then attempts to qualify CD/NW behavior in a footnote:

  2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits
     are cleared.

Presumably that footnote is only meant for INIT, as the RESET case and
especially the power-up case are rather non-sensical.  Another footnote
all but confirms that:

  6. Internal caches are invalid after power-up and RESET, but left
     unchanged with an INIT.

Bare metal testing shows that CD/NW are indeed preserved on INIT (someone
else can hack their BIOS to check RESET and power-up :-D).

Reported-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-47-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:02:00 -04:00
Sean Christopherson
265e43530c KVM: SVM: Emulate #INIT in response to triple fault shutdown
Emulate a full #INIT instead of simply initializing the VMCB if the
guest hits a shutdown.  Initializing the VMCB but not other vCPU state,
much of which is mirrored by the VMCB, results in incoherent and broken
vCPU state.

Ideally, KVM would not automatically init anything on shutdown, and
instead put the vCPU into e.g. KVM_MP_STATE_UNINITIALIZED and force
userspace to explicitly INIT or RESET the vCPU.  Even better would be to
add KVM_MP_STATE_SHUTDOWN, since technically NMI can break shutdown
(and SMI on Intel CPUs).

But, that ship has sailed, and emulating #INIT is the next best thing as
that has at least some connection with reality since there exist bare
metal platforms that automatically INIT the CPU if it hits shutdown.

Fixes: 46fe4ddd9dbb ("[PATCH] KVM: SVM: Propagate cpu shutdown events to userspace")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:59 -04:00
Sean Christopherson
f39e805ee1 KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86
Move the setting of CR0, CR4, EFER, RFLAGS, and RIP from vendor code to
common x86.  VMX and SVM now have near-identical sequences, the only
difference being that VMX updates the exception bitmap.  Updating the
bitmap on SVM is unnecessary, but benign.  Unfortunately it can't be left
behind in VMX due to the need to update exception intercepts after the
control registers are set.

Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-37-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:57 -04:00
Sean Christopherson
908b7d43c0 KVM: x86/mmu: Skip the permission_fault() check on MMIO if CR0.PG=0
Skip the MMU permission_fault() check if paging is disabled when
verifying the cached MMIO GVA is usable.  The check is unnecessary and
can theoretically get a false positive since the MMU doesn't zero out
"permissions" or "pkru_mask" when guest paging is disabled.

The obvious alternative is to zero out all the bitmasks when configuring
nonpaging MMUs, but that's unnecessary work and doesn't align with the
MMU's general approach of doing as little as possible for flows that are
supposed to be unreachable.

This is nearly a nop as the false positive is nothing more than an
insignificant performance blip, and more or less limited to string MMIO
when L1 is running with paging disabled.  KVM doesn't cache MMIO if L2 is
active with nested TDP since the "GVA" is really an L2 GPA.  If L2 is
active without nested TDP, then paging can't be disabled as neither VMX
nor SVM allows entering the guest without paging of some form.

Jumping back to L1 with paging disabled, in that case direct_map is true
and so KVM will use CR2 as a GPA; the only time it doesn't is if the
fault from the emulator doesn't match or emulator_can_use_gpa(), and that
fails only on string MMIO and other instructions with multiple memory
operands.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-27-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:54 -04:00
Sean Christopherson
49d8665cc2 KVM: x86: Move EDX initialization at vCPU RESET to common code
Move the EDX initialization at vCPU RESET, which is now identical between
VMX and SVM, into common code.

No functional change intended.

Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:52 -04:00
Sean Christopherson
df37ed38e6 KVM: x86: Flush the guest's TLB on INIT
Flush the guest's TLB on INIT, as required by Intel's SDM.  Although
AMD's APM states that the TLBs are unchanged by INIT, it's not clear that
that's correct as the APM also states that the TLB is flush on "External
initialization of the processor."  Regardless, relying on the guest to be
paranoid is unnecessarily risky, while an unnecessary flush is benign
from a functional perspective and likely has no measurable impact on
guest performance.

Note, as of the April 2021 version of Intels' SDM, it also contradicts
itself with respect to TLB flushing.  The overview of INIT explicitly
calls out the TLBs as being invalidated, while a table later in the same
section says they are unchanged.

  9.1 INITIALIZATION OVERVIEW:
    The major difference is that during an INIT, the internal caches, MSRs,
    MTRRs, and x87 FPU state are left unchanged (although, the TLBs and BTB
    are invalidated as with a hardware reset)

  Table 9-1:

  Register                    Power up    Reset      INIT
  Data and Code Cache, TLBs:  Invalid[6]  Invalid[6] Unchanged

Given Core2's erratum[*] about global TLB entries not being flush on INIT,
it's safe to assume that the table is simply wrong.

  AZ28. INIT Does Not Clear Global Entries in the TLB
  Problem: INIT may not flush a TLB entry when:
    • The processor is in protected mode with paging enabled and the page global enable
      flag is set (PGE bit of CR4 register)
    • G bit for the page table entry is set
    • TLB entry is present in TLB when INIT occurs
    • Software may encounter unexpected page fault or incorrect address translation due
      to a TLB entry erroneously left in TLB after INIT.

  Workaround: Write to CR3, CR4 (setting bits PSE, PGE or PAE) or CR0 (setting
              bits PG or PE) registers before writing to memory early in BIOS
              code to clear all the global entries from TLB.

  Status: For the steppings affected, see the Summary Tables of Changes.

[*] https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf

Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:48 -04:00
Maxim Levitsky
df63202fe5 KVM: x86: APICv: drop immediate APICv disablement on current vCPU
Special case of disabling the APICv on the current vCPU right away in
kvm_request_apicv_update doesn't bring much benefit vs raising
KVM_REQ_APICV_UPDATE on it instead, since this request will be processed
on the next entry to the guest.
(the comment about having another #VMEXIT is wrong).

It also hides various assumptions that APIVc enable state matches
the APICv inhibit state, as this special case only makes those states
match on the current vCPU.

Previous patches fixed few such assumptions so now it should be safe
to drop this special case.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210713142023.106183-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:48 -04:00
Peter Xu
ec1cf69c37 KVM: X86: Add per-vm stat for max rmap list size
Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap
for the vm.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210625153214.43106-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 09:36:37 -04:00
Sean Christopherson
e489a4a6bd KVM: x86: Hoist kvm_dirty_regs check out of sync_regs()
Move the kvm_dirty_regs vs. KVM_SYNC_X86_VALID_FIELDS check out of
sync_regs() and into its sole caller, kvm_arch_vcpu_ioctl_run().  This
allows a future patch to allow synchronizing select state for protected
VMs.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <889017a8d31cea46472e0c64b234ef5919278ed9.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 09:36:37 -04:00
Sean Christopherson
673692735f KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the VM
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <0e8760a26151f47dc47052b25ca8b84fffe0641e.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 09:36:36 -04:00
Paolo Bonzini
fa7a549d32 KVM: x86: accept userspace interrupt only if no event is injected
Once an exception has been injected, any side effects related to
the exception (such as setting CR2 or DR6) have been taked place.
Therefore, once KVM sets the VM-entry interruption information
field or the AMD EVENTINJ field, the next VM-entry must deliver that
exception.

Pending interrupts are processed after injected exceptions, so
in theory it would not be a problem to use KVM_INTERRUPT when
an injected exception is present.  However, DOSEMU is using
run->ready_for_interrupt_injection to detect interrupt windows
and then using KVM_SET_SREGS/KVM_SET_REGS to inject the
interrupt manually.  For this to work, the interrupt window
must be delayed after the completion of the previous event
injection.

Cc: stable@vger.kernel.org
Reported-by: Stas Sergeev <stsp2@yandex.ru>
Tested-by: Stas Sergeev <stsp2@yandex.ru>
Fixes: 71cc849b7093 ("KVM: x86: Fix split-irqchip vs interrupt injection window request")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-30 07:53:02 -04:00
Vitaly Kuznetsov
0a31df6823 KVM: x86: Check the right feature bit for MSR_KVM_ASYNC_PF_ACK access
MSR_KVM_ASYNC_PF_ACK MSR is part of interrupt based asynchronous page fault
interface and not the original (deprecated) KVM_FEATURE_ASYNC_PF. This is
stated in Documentation/virt/kvm/msr.rst.

Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20210722123018.260035-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-26 08:26:53 -04:00
Linus Torvalds
405386b021 * Allow again loading KVM on 32-bit non-PAE builds
* Fixes for host SMIs on AMD
 
 * Fixes for guest SMIs on AMD
 
 * Fixes for selftests on s390 and ARM
 
 * Fix memory leak
 
 * Enforce no-instrumentation area on vmentry when hardware
   breakpoints are in use.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDwRi4UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOt4AgAl6xEkMwDC74d/QFIOA7s2GD3ugfa
 z5XqGN1qz/nmEMnuIg6/tjTXDPmn/dfLMqy8RGZfyUv6xbgPcv/7JuFMRILvwGTb
 SbOVrGnR/QOhMdlfWH34qDkXeEsthTXSgQgVm/iiED0TttvQYVcZ/E9mgzaWQXor
 T1yTug2uAUXJ1EBxY0ZBo2kbh+BvvdmhEF0pksZOuwqZdH3zn3QCXwAwkL/OtUYE
 M6nNn3j1LU38C4OK1niXOZZVOuMIdk/l7LyFpjUQTFlIqitQAPtBE5MD+K+A9oC2
 Yocxyj2tId1e6o8bLic/oN8/LpdORTvA/wDMj5M1DcMzvxQuQIpGYkcVGg==
 =gjVA
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:

 - Allow again loading KVM on 32-bit non-PAE builds

 - Fixes for host SMIs on AMD

 - Fixes for guest SMIs on AMD

 - Fixes for selftests on s390 and ARM

 - Fix memory leak

 - Enforce no-instrumentation area on vmentry when hardware breakpoints
   are in use.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
  KVM: selftests: smm_test: Test SMM enter from L2
  KVM: nSVM: Restore nested control upon leaving SMM
  KVM: nSVM: Fix L1 state corruption upon return from SMM
  KVM: nSVM: Introduce svm_copy_vmrun_state()
  KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN
  KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA
  KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities
  KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails
  KVM: SVM: add module param to control the #SMI interception
  KVM: SVM: remove INIT intercept handler
  KVM: SVM: #SMI interception must not skip the instruction
  KVM: VMX: Remove vmx_msr_index from vmx.h
  KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()
  KVM: selftests: Address extra memslot parameters in vm_vaddr_alloc
  kvm: debugfs: fix memory leak in kvm_create_vm_debugfs
  KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM
  KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio
  KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler
  KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs
  KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR
  ...
2021-07-15 11:56:07 -07:00
Lai Jiangshan
f85d401606 KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()
When the host is using debug registers but the guest is not using them
nor is the guest in guest-debug state, the kvm code does not reset
the host debug registers before kvm_x86->run().  Rather, it relies on
the hardware vmentry instruction to automatically reset the dr7 registers
which ensures that the host breakpoints do not affect the guest.

This however violates the non-instrumentable nature around VM entry
and exit; for example, when a host breakpoint is set on vcpu->arch.cr2,

Another issue is consistency.  When the guest debug registers are active,
the host breakpoints are reset before kvm_x86->run(). But when the
guest debug registers are inactive, the host breakpoints are delayed to
be disabled.  The host tracing tools may see different results depending
on what the guest is doing.

To fix the problems, we clear %db7 unconditionally before kvm_x86->run()
if the host has set any breakpoints, no matter if the guest is using
them or not.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com>
Cc: stable@vger.kernel.org
[Only clear %db7 instead of reloading all debug registers. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-15 10:19:41 -04:00
Sean Christopherson
f0414b078d Revert "KVM: x86: WARN and reject loading KVM if NX is supported but not enabled"
Let KVM load if EFER.NX=0 even if NX is supported, the analysis and
testing (or lack thereof) for the non-PAE host case was garbage.

If the kernel won't be using PAE paging, .Ldefault_entry in head_32.S
skips over the entire EFER sequence.  Hopefully that can be changed in
the future to allow KVM to require EFER.NX, but the motivation behind
KVM's requirement isn't yet merged.  Reverting and revisiting the mess
at a later date is by far the safest approach.

This reverts commit 8bbed95d2cb6e5de8a342d761a89b0a04faed7be.

Fixes: 8bbed95d2cb6 ("KVM: x86: WARN and reject loading KVM if NX is supported but not enabled")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210625001853.318148-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-14 12:17:55 -04:00
Linus Torvalds
1423e2660c Fixes and improvements for FPU handling on x86:
- Prevent sigaltstack out of bounds writes. The kernel unconditionally
     writes the FPU state to the alternate stack without checking whether
     the stack is large enough to accomodate it.
 
     Check the alternate stack size before doing so and in case it's too
     small force a SIGSEGV instead of silently corrupting user space data.
 
   - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been
     updated despite the fact that the FPU state which is stored on the
     signal stack has grown over time which causes trouble in the field
     when AVX512 is available on a CPU. The kernel does not expose the
     minimum requirements for the alternate stack size depending on the
     available and enabled CPU features.
 
     ARM already added an aux vector AT_MINSIGSTKSZ for the same reason.
     Add it to x86 as well
 
   - A major cleanup of the x86 FPU code. The recent discoveries of XSTATE
     related issues unearthed quite some inconsistencies, duplicated code
     and other issues.
 
     The fine granular overhaul addresses this, makes the code more robust
     and maintainable, which allows to integrate upcoming XSTATE related
     features in sane ways.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDlcpETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeP5D/4i+AgYYeiMLgGb+NS7iaKPfoWo6LIz
 y3qdTSA0DQaIYbYivWwRO/g0GYdDMXDWeZalFi7eGnVI8O3eOog+22Zrf/y0UINB
 KJHdYd4ApWHhs401022y5hexrWQvnV8w1yQCuj/zLm6eC+AVhdwt2AY+IBoRrdUj
 wqY97B/4rJNsBvvqTDn9EeDrJA2y0y0Suc7AhIp2BGMI+dpIdxys8RJDamXNWyDL
 gJf0YRgUoiIn3AHKb+fgv60AoxfC175NSg/5/y/scFNXqVlW0Up4YCb7pqG9o2Ga
 f3XvtWfbw1N5PmUYjFkALwEkzGUbM3v0RA3xLY2j2WlWm9fBPPy59dt+i/h/VKyA
 GrA7i7lcIqX8dfVH6XkrReZBkRDSB6t9SZTvV54jAz5fcIZO2Rg++UFUvI/R6GKK
 XCcxukYaArwo+IG62iqDszS3gfLGhcor/cviOeULRC5zMUIO4Jah+IhDnifmShtC
 M5s9QzrwIRD/XMewGRQmvkiN4kBfE7jFoBQr1J9leCXJKrM+2JQmMzVInuubTQIq
 SdlKOaAIn7xtekz+6XdFG9Gmhck0PCLMJMOLNvQkKWI3KqGLRZ+dAWKK0vsCizAx
 0BA7ZeB9w9lFT+D8mQCX77JvW9+VNwyfwIOLIrJRHk3VqVpS5qvoiFTLGJJBdZx4
 /TbbRZu7nXDN2w==
 =Mq1m
 -----END PGP SIGNATURE-----

Merge tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fpu updates from Thomas Gleixner:
 "Fixes and improvements for FPU handling on x86:

   - Prevent sigaltstack out of bounds writes.

     The kernel unconditionally writes the FPU state to the alternate
     stack without checking whether the stack is large enough to
     accomodate it.

     Check the alternate stack size before doing so and in case it's too
     small force a SIGSEGV instead of silently corrupting user space
     data.

   - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never
     been updated despite the fact that the FPU state which is stored on
     the signal stack has grown over time which causes trouble in the
     field when AVX512 is available on a CPU. The kernel does not expose
     the minimum requirements for the alternate stack size depending on
     the available and enabled CPU features.

     ARM already added an aux vector AT_MINSIGSTKSZ for the same reason.
     Add it to x86 as well.

   - A major cleanup of the x86 FPU code. The recent discoveries of
     XSTATE related issues unearthed quite some inconsistencies,
     duplicated code and other issues.

     The fine granular overhaul addresses this, makes the code more
     robust and maintainable, which allows to integrate upcoming XSTATE
     related features in sane ways"

* tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
  x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again
  x86/fpu/signal: Let xrstor handle the features to init
  x86/fpu/signal: Handle #PF in the direct restore path
  x86/fpu: Return proper error codes from user access functions
  x86/fpu/signal: Split out the direct restore code
  x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing()
  x86/fpu/signal: Sanitize the xstate check on sigframe
  x86/fpu/signal: Remove the legacy alignment check
  x86/fpu/signal: Move initial checks into fpu__restore_sig()
  x86/fpu: Mark init_fpstate __ro_after_init
  x86/pkru: Remove xstate fiddling from write_pkru()
  x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate()
  x86/fpu: Remove PKRU handling from switch_fpu_finish()
  x86/fpu: Mask PKRU from kernel XRSTOR[S] operations
  x86/fpu: Hook up PKRU into ptrace()
  x86/fpu: Add PKRU storage outside of task XSAVE buffer
  x86/fpu: Dont restore PKRU in fpregs_restore_userspace()
  x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi()
  x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs()
  x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs()
  ...
2021-07-07 11:12:01 -07:00
Aaron Lewis
19238e75bd kvm: x86: Allow userspace to handle emulation errors
Add a fallback mechanism to the in-kernel instruction emulator that
allows userspace the opportunity to process an instruction the emulator
was unable to.  When the in-kernel instruction emulator fails to process
an instruction it will either inject a #UD into the guest or exit to
userspace with exit reason KVM_INTERNAL_ERROR.  This is because it does
not know how to proceed in an appropriate manner.  This feature lets
userspace get involved to see if it can figure out a better path
forward.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210510144834.658457-2-aaronlewis@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:48 -04:00
Sean Christopherson
20f632bd00 KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helper
Grab all CR0/CR4 MMU role bits from current vCPU state when initializing
a non-nested shadow MMU.  Extract the masks from kvm_post_set_cr{0,4}(),
as the CR0/CR4 update masks must exactly match the mmu_role bits, with
one exception (see below).  The "full" CR0/CR4 will be used by future
commits to initialize the MMU and its role, as opposed to the current
approach of pulling everything from vCPU, which is incorrect for certain
flows, e.g. nested NPT.

CR4.LA57 is an exception, as it can be toggled on VM-Exit (for L1's MMU)
but can't be toggled via MOV CR4 while long mode is active.  I.e. LA57
needs to be in the mmu_role, but technically doesn't need to be checked
by kvm_post_set_cr4().  However, the extra check is completely benign as
the hardware restrictions simply mean LA57 will never be _the_ cause of
a MMU reset during MOV CR4.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:39 -04:00
Sean Christopherson
dbc4739b6b KVM: x86: Fix sizes used to pass around CR0, CR4, and EFER
When configuring KVM's MMU, pass CR0 and CR4 as unsigned longs, and EFER
as a u64 in various flows (mostly MMU).  Passing the params as u32s is
functionally ok since all of the affected registers reserve bits 63:32 to
zero (enforced by KVM), but it's technically wrong.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:38 -04:00
Sean Christopherson
63f5a1909f KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken
Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
instability.  Initialize last_vmentry_cpu to -1 and use it to detect if
the vCPU has been run at least once when its CPUID model is changed.

KVM does not correctly handle changes to paging related settings in the
guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc...  KVM
could theoretically zap all shadow pages, but actually making that happen
is a mess due to lock inversion (vcpu->mutex is held).  And even then,
updating paging settings on the fly would only work if all vCPUs are
stopped, updated in concert with identical settings, then restarted.

To support running vCPUs with different vCPU models (that affect paging),
KVM would need to track all relevant information in kvm_mmu_page_role.
Note, that's the _page_ role, not the full mmu_role.  Updating mmu_role
isn't sufficient as a vCPU can reuse a shadow page translation that was
created by a vCPU with different settings and thus completely skip the
reserved bit checks (that are tied to CPUID).

Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
it would require doubling gfn_track from a u16 to a u32, i.e. would
increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
would all need to be tracked.

In practice, there is no remotely sane use case for changing any paging
related CPUID entries on the fly, so just sweep it under the rug (after
yelling at userspace).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:36 -04:00
Sean Christopherson
0aa1837533 KVM: x86: Properly reset MMU context at vCPU RESET/INIT
Reset the MMU context at vCPU INIT (and RESET for good measure) if CR0.PG
was set prior to INIT.  Simply re-initializing the current MMU is not
sufficient as the current root HPA may not be usable in the new context.
E.g. if TDP is disabled and INIT arrives while the vCPU is in long mode,
KVM will fail to switch to the 32-bit pae_root and bomb on the next
VM-Enter due to running with a 64-bit CR3 in 32-bit mode.

This bug was papered over in both VMX and SVM, but still managed to rear
its head in the MMU role on VMX.  Because EFER.LMA=1 requires CR0.PG=1,
kvm_calc_shadow_mmu_root_page_role() checks for EFER.LMA without first
checking CR0.PG.  VMX's RESET/INIT flow writes CR0 before EFER, and so
an INIT with the vCPU in 64-bit mode will cause the hack-a-fix to
generate the wrong MMU role.

In VMX, the INIT issue is specific to running without unrestricted guest
since unrestricted guest is available if and only if EPT is enabled.
Commit 8668a3c468ed ("KVM: VMX: Reset mmu context when entering real
mode") resolved the issue by forcing a reset when entering emulated real
mode.

In SVM, commit ebae871a509d ("kvm: svm: reset mmu on VCPU reset") forced
a MMU reset on every INIT to workaround the flaw in common x86.  Note, at
the time the bug was fixed, the SVM problem was exacerbated by a complete
lack of a CR4 update.

The vendor resets will be reverted in future patches, primarily to aid
bisection in case there are non-INIT flows that rely on the existing VMX
logic.

Because CR0.PG is unconditionally cleared on INIT, and because CR0.WP and
all CR4/EFER paging bits are ignored if CR0.PG=0, simply checking that
CR0.PG was '1' prior to INIT/RESET is sufficient to detect a required MMU
context reset.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:35 -04:00
Jing Zhang
bc9e9e672d KVM: debugfs: Reuse binary stats descriptors
To remove code duplication, use the binary stats descriptors in the
implementation of the debugfs interface for statistics. This unifies
the definition of statistics for the binary and debugfs interfaces.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-8-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:29 -04:00
Jing Zhang
ce55c04945 KVM: stats: Support binary stats retrieval for a VCPU
Add a VCPU ioctl to get a statistics file descriptor by which a read
functionality is provided for userspace to read out VCPU stats header,
descriptors and data.
Define VCPU statistics descriptors and header for all architectures.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-5-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:19 -04:00