701 Commits

Author SHA1 Message Date
Sean Christopherson
ea145aacf4 Revert "KVM: MMU: fast invalidate all pages"
Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
from the original series[1], now that all users of the fast invalidate
mechanism are gone.

This reverts commit 5304b8d37c2a5ebca48330f5e7868d240eafbed1.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:47 +01:00
Sean Christopherson
52d5dedc79 Revert "KVM: MMU: reclaim the zapped-obsolete page first"
Unwinding optimizations related to obsolete pages is a step towards
removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
a revert all patches from the series that introduced the mechanism[1].

This reverts commit 365c886860c4ba670d245e762b23987c912c129a.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:42 +01:00
Sean Christopherson
4771450c34 Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes"
Revert back to a dedicated (and slower) mechanism for handling the
scenario where all MMIO shadow PTEs need to be zapped due to overflowing
the MMIO generation number.  The MMIO generation scenario is almost
literally a one-in-a-million occurrence, i.e. is not a performance
sensitive scenario.

Restoring kvm_mmu_zap_mmio_sptes() leaves VM teardown as the only user
of kvm_mmu_invalidate_zap_all_pages() and paves the way for removing
the fast invalidate mechanism altogether.

This reverts commit a8eca9dcc656a405a28ffba43f3d86a1ff0eb331.

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:40 +01:00
Sean Christopherson
a592a3b8fc Revert "KVM: MMU: document fast invalidate all pages"
Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
from the original series[1].

Though not explicitly stated, for all intents and purposes the fast
invalidate mechanism was added to speed up the scenario where removing
a memslot, e.g. as part of accessing reading PCI ROM, caused KVM to
flush all shadow entries[1].  Now that the memslot case flushes only
shadow entries belonging to the memslot, i.e. doesn't use the fast
invalidate mechanism, the only remaining usage of the mechanism are
when the VM is being destroyed and when the MMIO generation rolls
over.

When a VM is being destroyed, either there are no active vcpus, i.e.
there's no lock contention, or the VM has ungracefully terminated, in
which case we want to reclaim its pages as quickly as possible, i.e.
not release the MMU lock if there are still CPUs executing in the VM.

The MMIO generation scenario is almost literally a one-in-a-million
occurrence, i.e. is not a performance sensitive scenario.

Given that lock-breaking is not desirable (VM teardown) or irrelevant
(MMIO generation overflow), remove the fast invalidate mechanism to
simplify the code (a small amount) and to discourage future code from
zapping all pages as using such a big hammer should be a last restort.

This reverts commit f6f8adeef542a18b1cb26a0b772c9781a10bb477.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:39 +01:00
Sean Christopherson
152482580a KVM: Call kvm_arch_memslots_updated() before updating memslots
kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.

Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots.  Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.

Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:32 +01:00
Sean Christopherson
95c7b77d6e KVM: x86: Explicitly #define the VCPU_REGS_* indices
Declaring the VCPU_REGS_* as enums allows for more robust C code, but it
prevents using the values in assembly files.  Expliciting #define the
indices in an asm-friendly file to prepare for VMX moving its transition
code to a proper assembly file, but keep the enums for general usage.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:47:38 +01:00
Sean Christopherson
e814349950 KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup
____kvm_handle_fault_on_reboot() provides a generic exception fixup
handler that is used to cleanly handle faults on VMX/SVM instructions
during reboot (or at least try to).  If there isn't a reboot in
progress, ____kvm_handle_fault_on_reboot() treats any exception as
fatal to KVM and invokes kvm_spurious_fault(), which in turn generates
a BUG() to get a stack trace and die.

When it was originally added by commit 4ecac3fd6dc2 ("KVM: Handle
virtualization instruction #UD faults during reboot"), the "call" to
kvm_spurious_fault() was handcoded as PUSH+JMP, where the PUSH'd value
is the RIP of the faulting instructing.

The PUSH+JMP trickery is necessary because the exception fixup handler
code lies outside of its associated function, e.g. right after the
function.  An actual CALL from the .fixup code would show a slightly
bogus stack trace, e.g. an extra "random" function would be inserted
into the trace, as the return RIP on the stack would point to no known
function (and the unwinder will likely try to guess who owns the RIP).

Unfortunately, the JMP was replaced with a CALL when the macro was
reworked to not spin indefinitely during reboot (commit b7c4145ba2eb
"KVM: Don't spin on virt instruction faults during reboot").  This
causes the aforementioned behavior where a bogus function is inserted
into the stack trace, e.g. my builds like to blame free_kvm_area().

Revert the CALL back to a JMP.  The changelog for commit b7c4145ba2eb
("KVM: Don't spin on virt instruction faults during reboot") contains
nothing that indicates the switch to CALL was deliberate.  This is
backed up by the fact that the PUSH <insn RIP> was left intact.

Note that an alternative to the PUSH+JMP magic would be to JMP back
to the "real" code and CALL from there, but that would require adding
a JMP in the non-faulting path to avoid calling kvm_spurious_fault()
and would add no value, i.e. the stack trace would be the same.

Using CALL:

------------[ cut here ]------------
kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
invalid opcode: 0000 [#1] SMP
CPU: 4 PID: 1057 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
Code: <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
RSP: 0018:ffffc900004bbcc8 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff888273fd8000 R08: 00000000000003e8 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000371fb0
R13: 0000000000000000 R14: 000000026d763cf4 R15: ffff888273fd8000
FS:  00007f3d69691700(0000) GS:ffff888277800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f89bc56fe0 CR3: 0000000271a5a001 CR4: 0000000000362ee0
Call Trace:
 free_kvm_area+0x1044/0x43ea [kvm_intel]
 ? vmx_vcpu_run+0x156/0x630 [kvm_intel]
 ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
 ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
 ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
 ? __set_task_blocked+0x38/0x90
 ? __set_current_blocked+0x50/0x60
 ? __fpu__restore_sig+0x97/0x490
 ? do_vfs_ioctl+0xa1/0x620
 ? __x64_sys_futex+0x89/0x180
 ? ksys_ioctl+0x66/0x70
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x4f/0x100
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
---[ end trace 9775b14b123b1713 ]---

Using JMP:

------------[ cut here ]------------
kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
invalid opcode: 0000 [#1] SMP
CPU: 6 PID: 1067 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
Code: <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
RSP: 0018:ffffc90000497cd0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88827058bd40 R08: 00000000000003e8 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000369fb0
R13: 0000000000000000 R14: 00000003c8fc6642 R15: ffff88827058bd40
FS:  00007f3d7219e700(0000) GS:ffff888277900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3d64001000 CR3: 0000000271c6b004 CR4: 0000000000362ee0
Call Trace:
 vmx_vcpu_run+0x156/0x630 [kvm_intel]
 ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
 ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
 ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
 ? __set_task_blocked+0x38/0x90
 ? __set_current_blocked+0x50/0x60
 ? __fpu__restore_sig+0x97/0x490
 ? do_vfs_ioctl+0xa1/0x620
 ? __x64_sys_futex+0x89/0x180
 ? ksys_ioctl+0x66/0x70
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x4f/0x100
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
---[ end trace f9daedb85ab3ddba ]---

Fixes: b7c4145ba2eb ("KVM: Don't spin on virt instruction faults during reboot")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-21 11:48:23 +01:00
Lan Tianyu
748c0e312f KVM: Make kvm_set_spte_hva() return int
The patch is to make kvm_set_spte_hva() return int and caller can
check return value to determine flush tlb or not.

Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-21 11:28:41 +01:00
Lan Tianyu
a49b96352e KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops
Add flush range call back in the kvm_x86_ops and platform can use it
to register its associated function. The parameter "kvm_tlb_range"
accepts a single range and flush list which contains a list of ranges.

Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-21 11:28:38 +01:00
Chao Peng
86f5201df0 KVM: x86: Add Intel Processor Trace cpuid emulation
Expose Intel Processor Trace to guest only when
the PT works in Host-Guest mode.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-21 11:28:35 +01:00
Marc Orr
b666a4b697 kvm: x86: Dynamically allocate guest_fpu
Previously, the guest_fpu field was embedded in the kvm_vcpu_arch
struct. Unfortunately, the field is quite large, (e.g., 4352 bytes on my
current setup). This bloats the kvm_vcpu_arch struct for x86 into an
order 3 memory allocation, which can become a problem on overcommitted
machines. Thus, this patch moves the fpu state outside of the
kvm_vcpu_arch struct.

With this patch applied, the kvm_vcpu_arch struct is reduced to 15168
bytes for vmx on my setup when building the kernel with kvmconfig.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Marc Orr <marcorr@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-14 18:00:08 +01:00
Marc Orr
240c35a378 kvm: x86: Use task structs fpu field for user
Previously, x86's instantiation of 'struct kvm_vcpu_arch' added an fpu
field to save/restore fpu-related architectural state, which will differ
from kvm's fpu state. However, this is redundant to the 'struct fpu'
field, called fpu, embedded in the task struct, via the thread field.
Thus, this patch removes the user_fpu field from the kvm_vcpu_arch
struct and replaces it with the task struct's fpu field.

This change is significant because the fpu struct is actually quite
large. For example, on the system used to develop this patch, this
change reduces the size of the vcpu_vmx struct from 23680 bytes down to
19520 bytes, when building the kernel with kvmconfig. This reduction in
the size of the vcpu_vmx struct moves us closer to being able to
allocate the struct at order 2, rather than order 3.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Marc Orr <marcorr@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-14 18:00:07 +01:00
Vitaly Kuznetsov
6a058a1ead x86/kvm/hyper-v: use stimer config definition from hyperv-tlfs.h
As a preparation to implementing Direct Mode for Hyper-V synthetic
timers switch to using stimer config definition from hyperv-tlfs.h.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-14 17:59:57 +01:00
Vitaly Kuznetsov
e2e871ab2f x86/kvm/hyper-v: Introduce nested_get_evmcs_version() helper
The upcoming KVM_GET_SUPPORTED_HV_CPUID ioctl will need to return
Enlightened VMCS version in HYPERV_CPUID_NESTED_FEATURES.EAX when
it was enabled.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-14 17:59:54 +01:00
Leonid Shatz
326e742533 KVM: nVMX/nSVM: Fix bug which sets vcpu->arch.tsc_offset to L1 tsc_offset
Since commit e79f245ddec1 ("X86/KVM: Properly update 'tsc_offset' to
represent the running guest"), vcpu->arch.tsc_offset meaning was
changed to always reflect the tsc_offset value set on active VMCS.
Regardless if vCPU is currently running L1 or L2.

However, above mentioned commit failed to also change
kvm_vcpu_write_tsc_offset() to set vcpu->arch.tsc_offset correctly.
This is because vmx_write_tsc_offset() could set the tsc_offset value
in active VMCS to given offset parameter *plus vmcs12->tsc_offset*.
However, kvm_vcpu_write_tsc_offset() just sets vcpu->arch.tsc_offset
to given offset parameter. Without taking into account the possible
addition of vmcs12->tsc_offset. (Same is true for SVM case).

Fix this issue by changing kvm_x86_ops->write_tsc_offset() to return
actually set tsc_offset in active VMCS and modify
kvm_vcpu_write_tsc_offset() to set returned value in
vcpu->arch.tsc_offset.
In addition, rename write_tsc_offset() callback to write_l1_tsc_offset()
to make it clear that it is meant to set L1 TSC offset.

Fixes: e79f245ddec1 ("X86/KVM: Properly update 'tsc_offset' to represent the running guest")
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Leonid Shatz <leonid.shatz@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27 12:50:10 +01:00
Jim Mattson
59073aaf6d kvm: x86: Add exception payload fields to kvm_vcpu_events
The per-VM capability KVM_CAP_EXCEPTION_PAYLOAD (to be introduced in a
later commit) adds the following fields to struct kvm_vcpu_events:
exception_has_payload, exception_payload, and exception.pending.

With this capability set, all of the details of vcpu->arch.exception,
including the payload for a pending exception, are reported to
userspace in response to KVM_GET_VCPU_EVENTS.

With this capability clear, the original ABI is preserved, and the
exception.injected field is set for either pending or injected
exceptions.

When userspace calls KVM_SET_VCPU_EVENTS with
KVM_CAP_EXCEPTION_PAYLOAD clear, exception.injected is no longer
translated to exception.pending. KVM_SET_VCPU_EVENTS can now only
establish a pending exception when KVM_CAP_EXCEPTION_PAYLOAD is set.

Reported-by: Jim Mattson <jmattson@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 19:07:38 +02:00
Jim Mattson
c851436a34 kvm: x86: Add has_payload and payload to kvm_queued_exception
The payload associated with a #PF exception is the linear address of
the fault to be loaded into CR2 when the fault is delivered. The
payload associated with a #DB exception is a mask of the DR6 bits to
be set (or in the case of DR6.RTM, cleared) when the fault is
delivered. Add fields has_payload and payload to kvm_queued_exception
to track payloads for pending exceptions.

The new fields are introduced here, but for now, they are just cleared.

Reported-by: Jim Mattson <jmattson@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:22 +02:00
Vitaly Kuznetsov
57b119da35 KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability
Enlightened VMCS is opt-in. The current version does not contain all
fields supported by nested VMX so we must not advertise the
corresponding VMX features if enlightened VMCS is enabled.

Userspace is given the enlightened VMCS version supported by KVM as
part of enabling KVM_CAP_HYPERV_ENLIGHTENED_VMCS. The version is to
be advertised to the nested hypervisor, currently done via a cpuid
leaf for Hyper-V.

Suggested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:14 +02:00
Vitaly Kuznetsov
7dcd575520 x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed
MMU reconfiguration in init_kvm_tdp_mmu()/kvm_init_shadow_mmu() can be
avoided if the source data used to configure it didn't change; enhance
MMU extended role with the required fields and consolidate common code in
kvm_calc_mmu_role_common().

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:06 +02:00
Vitaly Kuznetsov
a336282d77 x86/kvm/nVMX: introduce source data cache for kvm_init_shadow_ept_mmu()
MMU re-initialization is expensive, in particular,
update_permission_bitmask() and update_pkru_bitmask() are.

Cache the data used to setup shadow EPT MMU and avoid full re-init when
it is unchanged.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:06 +02:00
Vitaly Kuznetsov
36d9594dfb x86/kvm/mmu: make space for source data caching in struct kvm_mmu
In preparation to MMU reconfiguration avoidance we need a space to
cache source data. As this partially intersects with kvm_mmu_page_role,
create 64bit sized union kvm_mmu_role holding both base and extended data.
No functional change.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:05 +02:00
Paolo Bonzini
e173299101 x86/kvm/mmu: get rid of redundant kvm_mmu_setup()
Just inline the contents into the sole caller, kvm_init_mmu is now
public.

Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
2018-10-17 00:30:04 +02:00
Vitaly Kuznetsov
14c07ad89f x86/kvm/mmu: introduce guest_mmu
When EPT is used for nested guest we need to re-init MMU as shadow
EPT MMU (nested_ept_init_mmu_context() does that). When we return back
from L2 to L1 kvm_mmu_reset_context() in nested_vmx_load_cr3() resets
MMU back to normal TDP mode. Add a special 'guest_mmu' so we can use
separate root caches; the improved hit rate is not very important for
single vCPU performance, but it avoids contention on the mmu_lock for
many vCPUs.

On the nested CPUID benchmark, with 16 vCPUs, an L2->L1->L2 vmexit
goes from 42k to 26k cycles.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:04 +02:00
Vitaly Kuznetsov
6a82cd1c7b x86/kvm/mmu.c: add kvm_mmu parameter to kvm_mmu_free_roots()
Add an option to specify which MMU root we want to free. This will
be used when nested and non-nested MMUs for L1 are split.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
2018-10-17 00:30:03 +02:00
Vitaly Kuznetsov
44dd3ffa7b x86/kvm/mmu: make vcpu->mmu a pointer to the current MMU
As a preparation to full MMU split between L1 and L2 make vcpu->arch.mmu
a pointer to the currently used mmu. For now, this is always
vcpu->arch.root_mmu. No functional change.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
2018-10-17 00:30:02 +02:00
Vitaly Kuznetsov
e6b6c483eb KVM: x86: hyperv: fix 'tlb_lush' typo
Regardless of whether your TLB is lush or not it still needs flushing.

Reported-by: Roman Kagan <rkagan@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:00 +02:00
Vitaly Kuznetsov
87ee613d07 KVM: x86: hyperv: keep track of mismatched VP indexes
In most common cases VP index of a vcpu matches its vcpu index. Userspace
is, however, free to set any mapping it wishes and we need to account for
that when we need to find a vCPU with a particular VP index. To keep search
algorithms optimal in both cases introduce 'num_mismatched_vp_indexes'
counter showing how many vCPUs with mismatching VP index we have. In case
the counter is zero we can assume vp_index == vcpu_idx.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:45 +02:00
Wei Yang
4fef0f4913 KVM: x86: move definition PT_MAX_HUGEPAGE_LEVEL and KVM_NR_PAGE_SIZES together
Currently, there are two definitions related to huge page, but a little bit
far from each other and seems loosely connected:

 * KVM_NR_PAGE_SIZES defines the number of different size a page could map
 * PT_MAX_HUGEPAGE_LEVEL means the maximum level of huge page

The number of different size a page could map equals the maximum level
of huge page, which is implied by current definition.

While current implementation may not be kind to readers and further
developers:

 * KVM_NR_PAGE_SIZES looks like a stand alone definition at first sight
 * in case we need to support more level, two places need to change

This patch tries to make these two definition more close, so that reader
and developer would feel more comfortable to manipulate.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:42 +02:00
Wei Yang
3ff519f29d KVM: x86: adjust kvm_mmu_page member to save 8 bytes
On a 64bits machine, struct is naturally aligned with 8 bytes. Since
kvm_mmu_page member *unsync* and *role* are less then 4 bytes, we can
rearrange the sequence to compace the struct.

As the comment shows, *role* and *gfn* are used to key the shadow page. In
order to keep the comment valid, this patch moves the *unsync* up and
exchange the position of *role* and *gfn*.

From /proc/slabinfo, it shows the size of kvm_mmu_page is 8 bytes less and
with one more object per slap after applying this patch.

    # name            <active_objs> <num_objs> <objsize> <objperslab>
    kvm_mmu_page_header      0           0       168         24

    kvm_mmu_page_header      0           0       160         25

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:40 +02:00
Jim Mattson
cfb634fe30 KVM: nVMX: Clear reserved bits of #DB exit qualification
According to volume 3 of the SDM, bits 63:15 and 12:4 of the exit
qualification field for debug exceptions are reserved (cleared to
0). However, the SDM is incorrect about bit 16 (corresponding to
DR6.RTM). This bit should be set if a debug exception (#DB) or a
breakpoint exception (#BP) occurred inside an RTM region while
advanced debugging of RTM transactional regions was enabled. Note that
this is the opposite of DR6.RTM, which "indicates (when clear) that a
debug exception (#DB) or breakpoint exception (#BP) occurred inside an
RTM region while advanced debugging of RTM transactional regions was
enabled."

There is still an issue with stale DR6 bits potentially being
misreported for the current debug exception.  DR6 should not have been
modified before vectoring the #DB exception, and the "new DR6 bits"
should be available somewhere, but it was and they aren't.

Fixes: b96fb439774e1 ("KVM: nVMX: fixes to nested virt interrupt injection")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:39 +02:00
Drew Schmitt
6fbbde9a19 KVM: x86: Control guest reads of MSR_PLATFORM_INFO
Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
to reads of MSR_PLATFORM_INFO.

Disabling access to reads of this MSR gives userspace the control to "expose"
this platform-dependent information to guests in a clear way. As it exists
today, guests that read this MSR would get unpopulated information if userspace
hadn't already set it (and prior to this patch series, only the CPUID faulting
information could have been populated). This existing interface could be
confusing if guests don't handle the potential for incorrect/incomplete
information gracefully (e.g. zero reported for base frequency).

Signed-off-by: Drew Schmitt <dasch@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20 00:51:46 +02:00
Liran Alon
e6c67d8cf1 KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv
In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
it is possible for a vCPU to be blocked while it is in guest-mode.

According to Intel SDM 26.6.5 Interrupt-Window Exiting and
Virtual-Interrupt Delivery: "These events wake the logical processor
if it just entered the HLT state because of a VM entry".
Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
should be waken from the HLT state and injected with the interrupt.

In addition, if while the vCPU is blocked (while it is in guest-mode),
it receives a nested posted-interrupt, then the vCPU should also be
waken and injected with the posted interrupt.

To handle these cases, this patch enhances kvm_vcpu_has_events() to also
check if there is a pending interrupt in L2 virtual APICv provided by
L1. That is, it evaluates if there is a pending virtual interrupt for L2
by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
Evaluation of Pending Interrupts.

Note that this also handles the case of nested posted-interrupt by the
fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
kvm_vcpu_running() -> vmx_check_nested_events() ->
vmx_complete_nested_posted_interrupt().

Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20 00:51:44 +02:00
Sean Christopherson
d264ee0c2e KVM: VMX: use preemption timer to force immediate VMExit
A VMX preemption timer value of '0' is guaranteed to cause a VMExit
prior to the CPU executing any instructions in the guest.  Use the
preemption timer (if it's supported) to trigger immediate VMExit
in place of the current method of sending a self-IPI.  This ensures
that pending VMExit injection to L1 occurs prior to executing any
instructions in the guest (regardless of nesting level).

When deferring VMExit injection, KVM generates an immediate VMExit
from the (possibly nested) guest by sending itself an IPI.  Because
hardware interrupts are blocked prior to VMEnter and are unblocked
(in hardware) after VMEnter, this results in taking a VMExit(INTR)
before any guest instruction is executed.  But, as this approach
relies on the IPI being received before VMEnter executes, it only
works as intended when KVM is running as L0.  Because there are no
architectural guarantees regarding when IPIs are delivered, when
running nested the INTR may "arrive" long after L2 is running e.g.
L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.

For the most part, this unintended delay is not an issue since the
events being injected to L1 also do not have architectural guarantees
regarding their timing.  The notable exception is the VMX preemption
timer[1], which is architecturally guaranteed to cause a VMExit prior
to executing any instructions in the guest if the timer value is '0'
at VMEnter.  Specifically, the delay in injecting the VMExit causes
the preemption timer KVM unit test to fail when run in a nested guest.

Note: this approach is viable even on CPUs with a broken preemption
timer, as broken in this context only means the timer counts at the
wrong rate.  There are no known errata affecting timer value of '0'.

[1] I/O SMIs also have guarantees on when they arrive, but I have
    no idea if/how those are emulated in KVM.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Use a hook for SVM instead of leaving the default in x86.c - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20 00:51:42 +02:00
Wanpeng Li
bdf7ffc899 KVM: LAPIC: Fix pv ipis out-of-bounds access
Dan Carpenter reported that the untrusted data returns from kvm_register_read()
results in the following static checker warning:
  arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
  error: buffer underflow 'map->phys_map' 's32min-s32max'

KVM guest can easily trigger this by executing the following assembly sequence
in Ring0:

mov $10, %rax
mov $0xFFFFFFFF, %rbx
mov $0xFFFFFFFF, %rdx
mov $0, %rsi
vmcall

As this will cause KVM to execute the following code-path:
vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi()
which will reach out-of-bounds access.

This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id,
ignoring destinations that are not present and delivering the rest. We also check
whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the
max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm
unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Add second "if (min > map->max_apic_id)" to complete the fix. -Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-09-07 18:38:43 +02:00
Radim Krčmář
564ad0aa85 Fixes for KVM/ARM for Linux v4.19 v2:
- Fix a VFP corruption in 32-bit guest
  - Add missing cache invalidation for CoW pages
  - Two small cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJbkngmAAoJEEtpOizt6ddyeaoH/15bbGHlwWf23tGjSoDzhyD4
 zAXfy+SJdm4cR8K7jEkVrNffkEMAby7Zl28hTHKB9jsY1K8DD+EuCE3Nd4kkVAsc
 iHJwV4aiHil/zC5SyE0MqMzELeS8UhsxESYebG6yNF0ElQDQ0SG+QAFr47/OBN9S
 u4I7x0rhyJP6Kg8z9U4KtEX0hM6C7VVunGWu44/xZSAecTaMuJnItCIM4UMdEkSs
 xpAoI59lwM6BWrXLvEunekAkxEXoR7AVpQER2PDINoLK2I0i0oavhPim9Xdt2ZXs
 rqQqfmwmPOVvYbexDp97JtfWo3/psGLqvgoK1tq9bzF3u6Y3ylnUK5IspyVYwuQ=
 =TK8A
 -----END PGP SIGNATURE-----

Merge tag 'kvm-arm-fixes-for-v4.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm

Fixes for KVM/ARM for Linux v4.19 v2:

 - Fix a VFP corruption in 32-bit guest
 - Add missing cache invalidation for CoW pages
 - Two small cleanups
2018-09-07 18:38:25 +02:00
Marc Zyngier
a35381e10d KVM: Remove obsolete kvm_unmap_hva notifier backend
kvm_unmap_hva is long gone, and we only have kvm_unmap_hva_range to
deal with. Drop the now obsolete code.

Fixes: fb1522e099f0 ("KVM: update to new mmu_notifier semantic v2")
Cc: James Hogan <jhogan@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
2018-09-07 15:06:02 +02:00
Sean Christopherson
c60658d1d9 KVM: x86: Unexport x86_emulate_instruction()
Allowing x86_emulate_instruction() to be called directly has led to
subtle bugs being introduced, e.g. not setting EMULTYPE_NO_REEXECUTE
in the emulation type.  While most of the blame lies on re-execute
being opt-out, exporting x86_emulate_instruction() also exposes its
cr2 parameter, which may have contributed to commit d391f1207067
("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO
when running nested") using x86_emulate_instruction() instead of
emulate_instruction() because "hey, I have a cr2!", which in turn
introduced its EMULTYPE_NO_REEXECUTE bug.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-30 16:20:44 +02:00
Sean Christopherson
0ce97a2b62 KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction()
Lack of the kvm_ prefix gives the impression that it's a VMX or SVM
specific function, and there's no conflict that prevents adding the
kvm_ prefix.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-30 16:20:44 +02:00
Sean Christopherson
384bf2218e KVM: x86: Merge EMULTYPE_RETRY and EMULTYPE_ALLOW_REEXECUTE
retry_instruction() and reexecute_instruction() are a package deal,
i.e. there is no scenario where one is allowed and the other is not.
Merge their controlling emulation type flags to enforce this in code.
Name the combined flag EMULTYPE_ALLOW_RETRY to make it abundantly
clear that we are allowing re{try,execute} to occur, as opposed to
explicitly requesting retry of a previously failed instruction.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-30 16:20:43 +02:00
Sean Christopherson
8065dbd1ee KVM: x86: Invert emulation re-execute behavior to make it opt-in
Re-execution of an instruction after emulation decode failure is
intended to be used only when emulating shadow page accesses.  Invert
the flag to make allowing re-execution opt-in since that behavior is
by far in the minority.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-30 16:20:43 +02:00
Sean Christopherson
35be0aded7 KVM: x86: SVM: Set EMULTYPE_NO_REEXECUTE for RSM emulation
Re-execution after an emulation decode failure is only intended to
handle a case where two or vCPUs race to write a shadowed page, i.e.
we should never re-execute an instruction as part of RSM emulation.

Add a new helper, kvm_emulate_instruction_from_buffer(), to support
emulating from a pre-defined buffer.  This eliminates the last direct
call to x86_emulate_instruction() outside of kvm_mmu_page_fault(),
which means x86_emulate_instruction() can be unexported in a future
patch.

Fixes: 7607b7174405 ("KVM: SVM: install RSM intercept")
Cc: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-30 16:20:43 +02:00
Linus Torvalds
e61cf2e3a5 Minor code cleanups for PPC.
For x86 this brings in PCID emulation and CR3 caching for shadow page
 tables, nested VMX live migration, nested VMCS shadowing, an optimized
 IPI hypercall, and some optimizations.
 
 ARM will come next week.
 
 There is a semantic conflict because tip also added an .init_platform
 callback to kvm.c.  Please keep the initializer from this branch,
 and add a call to kvmclock_init (added by tip) inside kvm_init_platform
 (added here).
 
 Also, there is a backmerge from 4.18-rc6.  This is because of a
 refactoring that conflicted with a relatively late bugfix and
 resulted in a particularly hellish conflict.  Because the conflict
 was only due to unfortunate timing of the bugfix, I backmerged and
 rebased the refactoring rather than force the resolution on you.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJbdwNFAAoJEL/70l94x66DiPEH/1cAGZWGd85Y3yRu1dmTmqiz
 kZy0V+WTQ5kyJF4ZsZKKOp+xK7Qxh5e9kLdTo70uPZCHwLu9IaGKN9+dL9Jar3DR
 yLPX5bMsL8UUed9g9mlhdaNOquWi7d7BseCOnIyRTolb+cqnM5h3sle0gqXloVrS
 UQb4QogDz8+86czqR8tNfazjQRKW/D2HEGD5NDNVY1qtpY+leCDAn9/u6hUT5c6z
 EtufgyDh35UN+UQH0e2605gt3nN3nw3FiQJFwFF1bKeQ7k5ByWkuGQI68XtFVhs+
 2WfqL3ftERkKzUOy/WoSJX/C9owvhMcpAuHDGOIlFwguNGroZivOMVnACG1AI3I=
 =9Mgw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull first set of KVM updates from Paolo Bonzini:
 "PPC:
   - minor code cleanups

  x86:
   - PCID emulation and CR3 caching for shadow page tables
   - nested VMX live migration
   - nested VMCS shadowing
   - optimized IPI hypercall
   - some optimizations

  ARM will come next week"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits)
  kvm: x86: Set highest physical address bits in non-present/reserved SPTEs
  KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c
  KVM: X86: Implement PV IPIs in linux guest
  KVM: X86: Add kvm hypervisor init time platform setup callback
  KVM: X86: Implement "send IPI" hypercall
  KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs()
  KVM: x86: Skip pae_root shadow allocation if tdp enabled
  KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
  KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible
  KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible
  KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup
  KVM: vmx: move struct host_state usage to struct loaded_vmcs
  KVM: vmx: compute need to reload FS/GS/LDT on demand
  KVM: nVMX: remove a misleading comment regarding vmcs02 fields
  KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state()
  KVM: vmx: add dedicated utility to access guest's kernel_gs_base
  KVM: vmx: track host_state.loaded using a loaded_vmcs pointer
  KVM: vmx: refactor segmentation code in vmx_save_host_state()
  kvm: nVMX: Fix fault priority for VMX operations
  kvm: nVMX: Fix fault vector for VMX operation at CPL > 0
  ...
2018-08-19 10:38:36 -07:00
Wanpeng Li
4180bf1b65 KVM: X86: Implement "send IPI" hypercall
Using hypercall to send IPIs by one vmexit instead of one by one for
xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
is enabled in qemu, however, latest AMD EPYC still just supports xapic
mode which can get great improvement by Exit-less IPIs. This patchset
lets a guest send multicast IPIs, with at most 128 destinations per
hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.

Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141):

x2apic cluster mode, vanilla

 Dry-run:                         0,            2392199 ns
 Self-IPI:                  6907514,           15027589 ns
 Normal IPI:              223910476,          251301666 ns
 Broadcast IPI:                   0,         9282161150 ns
 Broadcast lock:                  0,         8812934104 ns

x2apic cluster mode, pv-ipi

 Dry-run:                         0,            2449341 ns
 Self-IPI:                  6720360,           15028732 ns
 Normal IPI:              228643307,          255708477 ns
 Broadcast IPI:                   0,         7572293590 ns  => 22% performance boost
 Broadcast lock:                  0,         8316124651 ns

x2apic physical mode, vanilla

 Dry-run:                         0,            3135933 ns
 Self-IPI:                  8572670,           17901757 ns
 Normal IPI:              226444334,          255421709 ns
 Broadcast IPI:                   0,        19845070887 ns
 Broadcast lock:                  0,        19827383656 ns

x2apic physical mode, pv-ipi

 Dry-run:                         0,            2446381 ns
 Self-IPI:                  6788217,           15021056 ns
 Normal IPI:              219454441,          249583458 ns
 Broadcast IPI:                   0,         7806540019 ns  => 154% performance boost
 Broadcast lock:                  0,         9143618799 ns

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:20 +02:00
Tianyu Lan
b08660e59d KVM: x86: Add tlb remote flush callback in kvm_x86_ops.
This patch is to provide a way for platforms to register hv tlb remote
flush callback and this helps to optimize operation of tlb flush
among vcpus for nested virtualization case.

Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:06 +02:00
Junaid Shahid
208320ba10 kvm: x86: Remove CR3_PCID_INVD flag
It is a duplicate of X86_CR3_PCID_NOFLUSH. So just use that instead.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:02 +02:00
Junaid Shahid
b94742c958 kvm: x86: Add multi-entry LRU cache for previous CR3s
Adds support for storing multiple previous CR3/root_hpa pairs maintained
as an LRU cache, so that the lockless CR3 switch path can be used when
switching back to any of them.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:02 +02:00
Junaid Shahid
faff87588d kvm: x86: Flush only affected TLB entries in kvm_mmu_invlpg*
This needs a minor bug fix. The updated patch is as follows.

Thanks,
Junaid

------------------------------------------------------------------------------

kvm_mmu_invlpg() and kvm_mmu_invpcid_gva() only need to flush the TLB
entries for the specific guest virtual address, instead of flushing all
TLB entries associated with the VM.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:01 +02:00
Junaid Shahid
08fb59d8a4 kvm: x86: Support selectively freeing either current or previous MMU root
kvm_mmu_free_roots() now takes a mask specifying which roots to free, so
that either one of the roots (active/previous) can be individually freed
when needed.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:59 +02:00
Junaid Shahid
7eb77e9f5f kvm: x86: Add a root_hpa parameter to kvm_mmu->invlpg()
This allows invlpg() to be called using either the active root_hpa
or the prev_root_hpa.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:58 +02:00
Junaid Shahid
ade61e2824 kvm: x86: Skip TLB flush on fast CR3 switch when indicated by guest
When PCIDs are enabled, the MSb of the source operand for a MOV-to-CR3
instruction indicates that the TLB doesn't need to be flushed.

This change enables this optimization for MOV-to-CR3s in the guest
that have been intercepted by KVM for shadow paging and are handled
within the fast CR3 switch path.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:58 +02:00