linux

iv/linux

Author	SHA1	Message	Date
David Woodhouse	f3d1436d4b	KVM: x86: Take srcu lock in post_kvm_run_save() The Xen interrupt injection for event channels relies on accessing the guest's vcpu_info structure in __kvm_xen_has_interrupt(), through a gfn_to_hva_cache. This requires the srcu lock to be held, which is mostly the case except for this code path: [ 11.822877] WARNING: suspicious RCU usage [ 11.822965] ----------------------------- [ 11.823013] include/linux/kvm_host.h:664 suspicious rcu_dereference_check() usage! [ 11.823131] [ 11.823131] other info that might help us debug this: [ 11.823131] [ 11.823196] [ 11.823196] rcu_scheduler_active = 2, debug_locks = 1 [ 11.823253] 1 lock held by dom:0/90: [ 11.823292] #0: ffff998956ec8118 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x680 [ 11.823379] [ 11.823379] stack backtrace: [ 11.823428] CPU: 2 PID: 90 Comm: dom:0 Kdump: loaded Not tainted 5.4.34+ #5 [ 11.823496] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 11.823612] Call Trace: [ 11.823645] dump_stack+0x7a/0xa5 [ 11.823681] lockdep_rcu_suspicious+0xc5/0x100 [ 11.823726] __kvm_xen_has_interrupt+0x179/0x190 [ 11.823773] kvm_cpu_has_extint+0x6d/0x90 [ 11.823813] kvm_cpu_accept_dm_intr+0xd/0x40 [ 11.823853] kvm_vcpu_ready_for_interrupt_injection+0x20/0x30 < post_kvm_run_save() inlined here > [ 11.823906] kvm_arch_vcpu_ioctl_run+0x135/0x6a0 [ 11.823947] kvm_vcpu_ioctl+0x263/0x680 Fixes: 40da8ccd724f ("KVM: x86/xen: Add event channel interrupt vector upcall") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Cc: stable@vger.kernel.org Message-Id: <606aaaf29fca3850a63aa4499826104e77a72346.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-28 10:45:38 -04:00
David Woodhouse	8228c77d8b	KVM: x86: switch pvclock_gtod_sync_lock to a raw spinlock On the preemption path when updating a Xen guest's runstate times, this lock is taken inside the scheduler rq->lock, which is a raw spinlock. This was shown in a lockdep warning: [ 89.138354] ============================= [ 89.138356] [ BUG: Invalid wait context ] [ 89.138358] 5.15.0-rc5+ #834 Tainted: G S I E [ 89.138360] ----------------------------- [ 89.138361] xen_shinfo_test/2575 is trying to lock: [ 89.138363] ffffa34a0364efd8 (&kvm->arch.pvclock_gtod_sync_lock){....}-{3:3}, at: get_kvmclock_ns+0x1f/0x130 [kvm] [ 89.138442] other info that might help us debug this: [ 89.138444] context-{5:5} [ 89.138445] 4 locks held by xen_shinfo_test/2575: [ 89.138447] #0: ffff972bdc3b8108 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x77/0x6f0 [kvm] [ 89.138483] #1: ffffa34a03662e90 (&kvm->srcu){....}-{0:0}, at: kvm_arch_vcpu_ioctl_run+0xdc/0x8b0 [kvm] [ 89.138526] #2: ffff97331fdbac98 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xff/0xbd0 [ 89.138534] #3: ffffa34a03662e90 (&kvm->srcu){....}-{0:0}, at: kvm_arch_vcpu_put+0x26/0x170 [kvm] ... [ 89.138695] get_kvmclock_ns+0x1f/0x130 [kvm] [ 89.138734] kvm_xen_update_runstate+0x14/0x90 [kvm] [ 89.138783] kvm_xen_update_runstate_guest+0x15/0xd0 [kvm] [ 89.138830] kvm_arch_vcpu_put+0xe6/0x170 [kvm] [ 89.138870] kvm_sched_out+0x2f/0x40 [kvm] [ 89.138900] __schedule+0x5de/0xbd0 Cc: stable@vger.kernel.org Reported-by: syzbot+b282b65c2c68492df769@syzkaller.appspotmail.com Fixes: 30b5c851af79 ("KVM: x86/xen: Add support for vCPU runstate information") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <1b02a06421c17993df337493a68ba923f3bd5c0f.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-25 08:14:38 -04:00
David Edmondson	e615e35589	KVM: x86: On emulation failure, convey the exit reason, etc. to userspace Should instruction emulation fail, include the VM exit reason, etc. in the emulation_failure data passed to userspace, in order that the VMM can report it as a debugging aid when describing the failure. Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210920103737.2696756-4-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-25 06:48:24 -04:00
Thomas Gleixner	d69c1382e1	x86/kvm: Convert FPU handling to a single swap buffer For the upcoming AMX support it's necessary to do a proper integration with KVM. Currently KVM allocates two FPU structs which are used for saving the user state of the vCPU thread and restoring the guest state when entering vcpu_run() and doing the reverse operation before leaving vcpu_run(). With the new fpstate mechanism this can be reduced to one extra buffer by swapping the fpstate pointer in current:🧵:fpu. This makes the upcoming support for AMX and XFD simpler because then fpstate information (features, sizes, xfd) are always consistent and it does not require any nasty workarounds. Convert the KVM FPU code over to this new scheme. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211022185313.019454292@linutronix.de	2021-10-23 16:13:29 +02:00
Sean Christopherson	187c8833de	KVM: x86: Use rw_semaphore for APICv lock to allow vCPU parallelism Use a rw_semaphore instead of a mutex to coordinate APICv updates so that vCPUs responding to requests can take the lock for read and run in parallel. Using a mutex forces serialization of vCPUs even though kvm_vcpu_update_apicv() only touches data local to that vCPU or is protected by a different lock, e.g. SVM's ir_list_lock. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022004927.1448382-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 11:20:16 -04:00
Sean Christopherson	ee49a89329	KVM: x86: Move SVM's APICv sanity check to common x86 Move SVM's assertion that vCPU's APICv state is consistent with its VM's state out of svm_vcpu_run() and into x86's common inner run loop. The assertion and underlying logic is not unique to SVM, it's just that SVM has more inhibiting conditions and thus is more likely to run headfirst into any KVM bugs. Add relevant comments to document exactly why the update path has unusual ordering between the update the kick, why said ordering is safe, and also the basic rules behind the assertion in the run loop. Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022004927.1448382-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 11:20:16 -04:00
Paolo Bonzini	95e16b4792	KVM: SEV-ES: go over the sev_pio_data buffer in multiple passes if needed The PIO scratch buffer is larger than a single page, and therefore it is not possible to copy it in a single step to vcpu->arch/pio_data. Bound each call to emulator_pio_in/out to a single page; keep track of how many I/O operations are left in vcpu->arch.sev_pio_count, so that the operation can be restarted in the complete_userspace_io callback. For OUT, this means that the previous kvm_sev_es_outs implementation becomes an iterator of the loop, and we can consume the sev_pio_data buffer before leaving to userspace. For IN, instead, consuming the buffer and decreasing sev_pio_count is always done in the complete_userspace_io callback, because that is when the memcpy is done into sev_pio_data. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reported-by: Felix Wilhelm <fwilhelm@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:09:13 -04:00
Paolo Bonzini	4fa4b38dae	KVM: SEV-ES: keep INS functions together Make the diff a little nicer when we actually get to fixing the bug. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:08:51 -04:00
Paolo Bonzini	6b5efc930b	KVM: x86: remove unnecessary arguments from complete_emulator_pio_in complete_emulator_pio_in can expect that vcpu->arch.pio has been filled in, and therefore does not need the size and count arguments. This makes things nicer when the function is called directly from a complete_userspace_io callback. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:08:38 -04:00
Paolo Bonzini	3b27de2718	KVM: x86: split the two parts of emulator_pio_in emulator_pio_in handles both the case where the data is pending in vcpu->arch.pio.count, and the case where I/O has to be done via either an in-kernel device or a userspace exit. For SEV-ES we would like to split these, to identify clearly the moment at which the sev_pio_data is consumed. To this end, create two different functions: __emulator_pio_in fills in vcpu->arch.pio.count, while complete_emulator_pio_in clears it and releases vcpu->arch.pio.data. Because this patch has to be backported, things are left a bit messy. kernel_pio() operates on vcpu->arch.pio, which leads to emulator_pio_in() having with two calls to complete_emulator_pio_in(). It will be fixed in the next release. While at it, remove the unused void* val argument of emulator_pio_in_out. The function currently hardcodes vcpu->arch.pio_data as the source/destination buffer, which sucks but will be fixed after the more severe SEV-ES buffer overflow. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:08:00 -04:00
Paolo Bonzini	ea724ea420	KVM: SEV-ES: clean up kvm_sev_es_ins/outs A few very small cleanups to the functions, smushed together because the patch is already very small like this: - inline emulator_pio_in_emulated and emulator_pio_out_emulated, since we already have the vCPU - remove the data argument and pull setting vcpu->arch.sev_pio_data into the caller - remove unnecessary clearing of vcpu->arch.pio.count when emulation is done by the kernel (and therefore vcpu->arch.pio.count is already clear on exit from emulator_pio_in and emulator_pio_out). No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:02:20 -04:00
Paolo Bonzini	0d33b1baeb	KVM: x86: leave vcpu->arch.pio.count alone in emulator_pio_in_out Currently emulator_pio_in clears vcpu->arch.pio.count twice if emulator_pio_in_out performs kernel PIO. Move the clear into emulator_pio_out where it is actually necessary. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:02:07 -04:00
Paolo Bonzini	b5998402e3	KVM: SEV-ES: rename guest_ins_data to sev_pio_data We will be using this field for OUTS emulation as well, in case the data that is pushed via OUTS spans more than one page. In that case, there will be a need to save the data pointer across exits to userspace. So, change the name to something that refers to any kind of PIO. Also spell out what it is used for, namely SEV-ES. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:01:26 -04:00
Lai Jiangshan	61b05a9fd4	KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest() kvm_mmu_unload() destroys all the PGD caches. Use the lighter kvm_mmu_sync_roots() and kvm_mmu_sync_prev_roots() instead. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:44:43 -04:00
Lai Jiangshan	509bfe3d97	KVM: X86: Cache CR3 in prev_roots when PCID is disabled The commit 21823fbda5522 ("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") invalidates all PGDs for the specific PCID and in the case of PCID is disabled, it includes all PGDs in the prev_roots and the commit made prev_roots totally unused in this case. Not using prev_roots fixes a problem when CR4.PCIDE is changed 0 -> 1 before the said commit: (CR4.PCIDE=0, CR4.PGE=1; CR3=cr3_a; the page for the guest RIP is global; cr3_b is cached in prev_roots) modify page tables under cr3_b the shadow root of cr3_b is unsync in kvm INVPCID single context the guest expects the TLB is clean for PCID=0 change CR4.PCIDE 0 -> 1 switch to cr3_b with PCID=0,NOFLUSH=1 No sync in kvm, cr3_b is still unsync in kvm jump to the page that was modified in step 1 shadow page tables point to the wrong page It is a very unlikely case, but it shows that stale prev_roots can be a problem after CR4.PCIDE changes from 0 to 1. However, to fix this case, the commit disabled caching CR3 in prev_roots altogether when PCID is disabled. Not all CPUs have PCID; especially the PCID support for AMD CPUs is kind of recent. To restore the prev_roots optimization for CR4.PCIDE=0, flush the whole MMU (including all prev_roots) when CR4.PCIDE changes. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:30 -04:00
Lai Jiangshan	e45e9e3998	KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid() The KVM doesn't know whether any TLB for a specific pcid is cached in the CPU when tdp is enabled. So it is better to flush all the guest TLB when invalidating any single PCID context. The case is very rare or even impossible since KVM generally doesn't intercept CR3 write or INVPCID instructions when tdp is enabled, so the fix is mostly for the sake of overall robustness. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:30 -04:00
Lai Jiangshan	a91a7c7096	KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE X86_CR4_PGE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. It is also inconsistent that X86_CR4_PGE is in KVM_MMU_CR4_ROLE_BITS while kvm_mmu_role doesn't use X86_CR4_PGE. So X86_CR4_PGE is also removed from KVM_MMU_CR4_ROLE_BITS. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:29 -04:00
Lai Jiangshan	552617382c	KVM: X86: Don't reset mmu context when X86_CR4_PCIDE 1->0 X86_CR4_PCIDE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:29 -04:00
Sean Christopherson	9dadfc4a61	KVM: x86: Add vendor name to kvm_x86_ops, use it for error messages Paul pointed out the error messages when KVM fails to load are unhelpful in understanding exactly what went wrong if userspace probes the "wrong" module. Add a mandatory kvm_x86_ops field to track vendor module names, kvm_intel and kvm_amd, and use the name for relevant error message when KVM fails to load so that the user knows which module failed to load. Opportunistically tweak the "disabled by bios" error message to clarify that _support_ was disabled, not that the module itself was magically disabled by BIOS. Suggested-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211018183929.897461-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:28 -04:00
David Stevens	1e76a3ce0d	KVM: cleanup allocation of rmaps and page tracking data Unify the flags for rmaps and page tracking data, using a single flag in struct kvm_arch and a single loop to go over all the address spaces and memslots. This avoids code duplication between alloc_all_memslots_rmaps and kvm_page_track_enable_mmu_write_tracking. Signed-off-by: David Stevens <stevensd@chromium.org> [This patch is the delta between David's v2 and v3, with conflicts fixed and my own commit message. - Paolo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:25 -04:00
Paolo Bonzini	de7cd3f676	KVM: x86: check for interrupts before deciding whether to exit the fast path The kvm_x86_sync_pir_to_irr callback can sometimes set KVM_REQ_EVENT. If that happens exactly at the time that an exit is handled as EXIT_FASTPATH_REENTER_GUEST, vcpu_enter_guest will go incorrectly through the loop that calls kvm_x86_run, instead of processing the request promptly. Fixes: 379a3c8ee444 ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath") Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-21 03:35:41 -04:00
Thomas Gleixner	1c57572d75	x86/KVM: Convert to fpstate Convert KVM code to the new register storage mechanism in preparation for dynamically sized buffers. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211013145322.451439983@linutronix.de	2021-10-20 22:34:14 +02:00
Thomas Gleixner	087df48c29	x86/fpu: Replace KVMs xstate component clearing In order to prepare for the support of dynamically enabled FPU features, move the clearing of xstate components to the FPU core code. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211013145322.399567049@linutronix.de	2021-10-20 22:26:41 +02:00
Thomas Gleixner	bf5d004707	x86/fpu: Replace KVMs home brewed FPU copy to user Similar to the copy from user function the FPU core has this already implemented with all bells and whistles. Get rid of the duplicated code and use the core functionality. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211015011539.244101845@linutronix.de	2021-10-20 22:17:17 +02:00
Thomas Gleixner	ea4d6938d4	x86/fpu: Replace KVMs home brewed FPU copy from user Copying a user space buffer to the memory buffer is already available in the FPU core. The copy mechanism in KVM lacks sanity checks and needs to use cpuid() to lookup the offset of each component, while the FPU core has this information cached. Make the FPU core variant accessible for KVM and replace the home brewed mechanism. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211015011539.134065207@linutronix.de	2021-10-20 15:27:27 +02:00
Thomas Gleixner	a0ff0611c2	x86/fpu: Move KVMs FPU swapping to FPU core Swapping the host/guest FPU is directly fiddling with FPU internals which requires 5 exports. The upcoming support of dynamically enabled states would even need more. Implement a swap function in the FPU core code and export that instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211015011539.076072399@linutronix.de	2021-10-20 15:27:27 +02:00
Thomas Gleixner	126fe04018	x86/fpu: Cleanup xstate xcomp_bv initialization No point in having this duplicated all over the place with needlessly different defines. Provide a proper initialization function which initializes user buffers properly and make KVM use it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211015011538.897664678@linutronix.de	2021-10-20 15:27:26 +02:00
Oliver Upton	828ca89628	KVM: x86: Expose TSC offset controls to userspace To date, VMM-directed TSC synchronization and migration has been a bit messy. KVM has some baked-in heuristics around TSC writes to infer if the VMM is attempting to synchronize. This is problematic, as it depends on host userspace writing to the guest's TSC within 1 second of the last write. A much cleaner approach to configuring the guest's views of the TSC is to simply migrate the TSC offset for every vCPU. Offsets are idempotent, and thus not subject to change depending on when the VMM actually reads/writes values from/to KVM. The VMM can then read the TSC once with KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when the guest is paused. Cc: David Matlack <dmatlack@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-8-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:45 -04:00
Oliver Upton	58d4277be9	KVM: x86: Refactor tsc synchronization code Refactor kvm_synchronize_tsc to make a new function that allows callers to specify TSC parameters (offset, value, nanoseconds, etc.) explicitly for the sake of participating in TSC synchronization. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-7-oupton@google.com> [Make sure kvm->arch.cur_tsc_generation and vcpu->arch.this_tsc_generation are equal at the end of __kvm_synchronize_tsc, if matched is false. Reported by Maxim Levitsky. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:45 -04:00
Paolo Bonzini	869b44211a	kvm: x86: protect masterclock with a seqcount Protect the reference point for kvmclock with a seqcount, so that kvmclock updates for all vCPUs can proceed in parallel. Xen runstate updates will also run in parallel and not bounce the kvmclock cacheline. Of the variables that were protected by pvclock_gtod_sync_lock, nr_vcpus_matched_tsc is different because it is updated outside pvclock_update_vm_gtod_copy and read inside it. Therefore, we need to keep it protected by a spinlock. In fact it must now be a raw spinlock, because pvclock_update_vm_gtod_copy, being the write-side of a seqcount, is non-preemptible. Since we already have tsc_write_lock which is a raw spinlock, we can just use tsc_write_lock as the lock that protects the write-side of the seqcount. Co-developed-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-6-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:44 -04:00
Oliver Upton	c68dc1b577	KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK Handling the migration of TSCs correctly is difficult, in part because Linux does not provide userspace with the ability to retrieve a (TSC, realtime) clock pair for a single instant in time. In lieu of a more convenient facility, KVM can report similar information in the kvm_clock structure. Provide userspace with a host TSC & realtime pair iff the realtime clock is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid realtime value, advance the KVM clock by the amount of elapsed time. Do not step the KVM clock backwards, though, as it is a monotonic oscillator. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-5-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:44 -04:00
Paolo Bonzini	a25c78d04c	Merge commit 'kvm-pagedata-alloc-fixes' into HEAD	2021-10-18 14:13:37 -04:00
Paolo Bonzini	fa13843d15	KVM: X86: fix lazy allocation of rmaps If allocation of rmaps fails, but some of the pointers have already been written, those pointers can be cleaned up when the memslot is freed, or even reused later for another attempt at allocating the rmaps. Therefore there is no need to WARN, as done for example in memslot_rmap_alloc, but the allocation must be skipped lest KVM will overwrite the previous pointer and will indeed leak memory. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:07:17 -04:00
David Stevens	deae4a10f1	KVM: x86: only allocate gfn_track when necessary Avoid allocating the gfn_track arrays if nothing needs them. If there are no external to KVM users of the API (i.e. no GVT-g), then page tracking is only needed for shadow page tables. This means that when tdp is enabled and there are no external users, then the gfn_track arrays can be lazily allocated when the shadow MMU is actually used. This avoid allocations equal to .05% of guest memory when nested virtualization is not used, if the kernel is compiled without GVT-g. Signed-off-by: David Stevens <stevensd@chromium.org> Message-Id: <20210922045859.2011227-3-stevensd@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:58 -04:00
Juergen Gross	78b497f2e6	kvm: use kvfree() in kvm_arch_free_vm() By switching from kfree() to kvfree() in kvm_arch_free_vm() Arm64 can use the common variant. This can be accomplished by adding another macro __KVM_HAVE_ARCH_VM_FREE, which will be used only by x86 for now. Further simplification can be achieved by adding __kvm_arch_free_vm() doing the common part. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-Id: <20210903130808.30142-5-jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:57 -04:00
Oliver Upton	55c0cefbdb	KVM: x86: Fix potential race in KVM_GET_CLOCK Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock outside of the pvclock sync lock. This is problematic, as the clock value written to the user may or may not actually correspond to a stable TSC. Fix the race by populating the entire kvm_clock_data structure behind the pvclock_gtod_sync_lock. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-4-oupton@google.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:47 -04:00
Paolo Bonzini	45e6c2fac0	KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions No functional change intended. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:47 -04:00
Paolo Bonzini	6b6fcd2804	kvm: x86: abstract locking around pvclock_update_vm_gtod_copy Updates to the kvmclock parameters needs to do a complicated dance of KVM_REQ_MCLOCK_INPROGRESS and KVM_REQ_CLOCK_UPDATE in addition to taking pvclock_gtod_sync_lock. Place that in two functions that can be called on all of master clock update, KVM_SET_CLOCK, and Hyper-V reenlightenment. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:47 -04:00
Maxim Levitsky	5228eb96a4	KVM: x86: nSVM: implement nested TSC scaling This was tested by booting a nested guest with TSC=1Ghz, observing the clocks, and doing about 100 cycles of migration. Note that qemu patch is needed to support migration because of a new MSR that needs to be placed in the migration state. The patch will be sent to the qemu mailing list soon. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210914154825.104886-14-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:46 -04:00
Longpeng(Mike)	515a0c79e7	kvm: irqfd: avoid update unmodified entries of the routing All of the irqfds would to be updated when update the irq routing, it's too expensive if there're too many irqfds. However we can reduce the cost by avoid some unnecessary updates. For irqs of MSI type on X86, the update can be saved if the msi values are not change. The vfio migration could receives benefit from this optimi- zaiton. The test VM has 128 vcpus and 8 VF (with 65 vectors enabled), so the VM has more than 520 irqfds. We mesure the cost of the vfio_msix_enable (in QEMU, it would set routing for each irqfd) for each VF, and we can see the total cost can be significantly reduced. Origin Apply this Patch 1st 8 4 2nd 15 5 3rd 22 6 4th 24 6 5th 36 7 6th 44 7 7th 51 8 8th 58 8 Total 258ms 51ms We're also tring to optimize the QEMU part [1], but it's still worth to optimize the KVM to gain more benefits. [1] https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04215.html Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Message-Id: <20210827080003.2689-1-longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:10 -04:00
Sean Christopherson	25b9784586	KVM: x86: Manually retrieve CPUID.0x1 when getting FMS for RESET/INIT Manually look for a CPUID.0x1 entry instead of bouncing through kvm_cpuid() when retrieving the Family-Model-Stepping information for vCPU RESET/INIT. This fixes a potential undefined behavior bug due to kvm_cpuid() using the uninitialized "dummy" param as the ECX _input_, a.k.a. the index. A more minimal fix would be to simply zero "dummy", but the extra work in kvm_cpuid() is wasteful, and KVM should be treating the FMS retrieval as an out-of-band access, e.g. same as how KVM computes guest.MAXPHYADDR. Both Intel's SDM and AMD's APM describe the RDX value at RESET/INIT as holding the CPU's FMS information, not as holding CPUID.0x1.EAX. KVM's usage of CPUID entries to get FMS is simply a pragmatic approach to avoid having yet another way for userspace to provide inconsistent data. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20210929222426.1855730-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:08 -04:00
Sean Christopherson	62dd57dd67	KVM: x86: WARN on non-zero CRs at RESET to detect improper initalization WARN if CR0, CR3, or CR4 are non-zero at RESET, which given the current KVM implementation, really means WARN if they're not zeroed at vCPU creation. VMX in particular has several ->set_*() flows that read other registers to handle side effects, and because those flows are common to RESET and INIT, KVM subtly relies on emulated/virtualized registers to be zeroed at vCPU creation in order to do the right thing at RESET. Use CRs as a sentinel because they are most likely to be written as side effects, and because KVM specifically needs CR0.PG and CR0.PE to be '0' to correctly reflect the state of the vCPU's MMU. CRs are also loaded and stored from/to the VMCS, and so adds some level of coverage to verify that KVM doesn't conflate zero-allocating the VMCS with properly initializing the VMCS with VMWRITEs. Note, '0' is somewhat arbitrary, vCPU creation can technically stuff any value for a register so long as it's coherent with respect to the current vCPU state. In practice, '0' works for all registers and is convenient. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:07 -04:00
Sean Christopherson	583d369b36	KVM: x86: Fold fx_init() into kvm_arch_vcpu_create() Move the few bits of relevant fx_init() code into kvm_arch_vcpu_create(), dropping the superfluous check on vcpu->arch.guest_fpu that was blindly and wrongly added by commit ed02b213098a ("KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest"). Note, KVM currently allocates and then frees FPU state for SEV-ES guests, rather than avoid the allocation in the first place. While that approach is inarguably inefficient and unnecessary, it's a cleanup for the future. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:06 -04:00
Sean Christopherson	e8f65b9bb4	KVM: x86: Remove defunct setting of XCR0 for guest during vCPU create Drop code to initialize XCR0 during fx_init(), a.k.a. vCPU creation, as XCR0 has been initialized during kvm_vcpu_reset() (for RESET) since commit a554d207dc46 ("KVM: X86: Processor States following Reset or INIT"). Back when XCR0 support was added by commit 2acf923e38fb ("KVM: VMX: Enable XSAVE/XRSTOR for guest"), KVM didn't differentiate between RESET and INIT. Ignoring the fact that calling fx_init() for INIT is obviously wrong, e.g. FPU state after INIT is not the same as after RESET, setting XCR0 in fx_init() was correct. Eventually fx_init() got moved to kvm_arch_vcpu_init(), a.k.a. vCPU creation (ignore the terrible name) by commit 0ee6a5172573 ("x86/fpu, kvm: Simplify fx_init()"). Finally, commit 95a0d01eef7a ("KVM: x86: Move all vcpu init code into kvm_arch_vcpu_create()") killed off kvm_arch_vcpu_init(), leaving behind the oddity of redundant setting of guest state during vCPU creation. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:06 -04:00
Sean Christopherson	5ebbc470d7	KVM: x86: Remove defunct setting of CR0.ET for guests during vCPU create Drop code to set CR0.ET for the guest during initialization of the guest FPU. The code was added as a misguided bug fix by commit 380102c8e431 ("KVM Set the ET flag in CR0 after initializing FX") to resolve an issue where vcpu->cr0 (now vcpu->arch.cr0) was not correctly initialized on SVM systems. While init_vmcb() did set CR0.ET, it only did so in the VMCB, and subtly did not update vcpu->cr0. Stuffing CR0.ET worked around the immediate problem, but did not fix the real bug of vcpu->cr0 and the VMCB being out of sync. That underlying bug was eventually remedied by commit 18fa000ae453 ("KVM: SVM: Reset cr0 properly on vcpu reset"). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:06 -04:00
Sean Christopherson	ff8828c84f	KVM: x86: Do not mark all registers as avail/dirty during RESET/INIT Do not blindly mark all registers as available+dirty at RESET/INIT, and instead rely on writes to registers to go through the proper mutators or to explicitly mark registers as dirty. INIT in particular does not blindly overwrite all registers, e.g. select bits in CR0 are preserved across INIT, thus marking registers available+dirty without first reading the register from hardware is incorrect. In practice this is a benign bug as KVM doesn't let the guest control CR0 bits that are preserved across INIT, and all other true registers are explicitly written during the RESET/INIT flows. The PDPTRs and EX_INFO "registers" are not explicitly written, but accessing those values during RESET/INIT is nonsensical and would be a KVM bug regardless of register caching. Fixes: 66f7b72e1171 ("KVM: x86: Make register state after reset conform to specification") [sean: !!! NOT FOR STABLE !!!] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00
Sean Christopherson	94c641ba7a	KVM: x86: Simplify retrieving the page offset when loading PDTPRs Replace impressively complex "logic" for computing the page offset from CR3 when loading PDPTRs. Unlike other paging modes, the address held in CR3 for PAE paging is 32-byte aligned, i.e. occupies bits 31:5, thus bits 11:5 need to be used as the offset from the gfn when reading PDPTRs. The existing calculation originated in commit 1342d3536d6a ("[PATCH] KVM: MMU: Load the pae pdptrs on cr3 change like the processor does"), which read the PDPTRs from guest memory as individual 8-byte loads. At the time, the so called "offset" was the base index of PDPTR0 as a _u64_, not a byte offset. Naming aside, the computation was useful and arguably simplified the overall flow. Unfortunately, when commit 195aefde9cc2 ("KVM: Add general accessors to read and write guest memory") added accessors with offsets at byte granularity, the cleverness of the original code was lost and KVM was left with convoluted code for a simple operation. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210831164224.1119728-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00
Sean Christopherson	15cabbc259	KVM: x86: Subsume nested GPA read helper into load_pdptrs() Open code the call to mmu->translate_gpa() when loading nested PDPTRs and kill off the existing helper, kvm_read_guest_page_mmu(), to discourage incorrect use. Reading guest memory straight from an L2 GPA is extremely rare (as evidenced by the lack of users), as very few constructs in x86 specify physical addresses, even fewer are virtualized by KVM, and even fewer yet require emulation of L2 by L0 KVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210831164224.1119728-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00
Juergen Gross	a1c42ddedf	kvm: rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS KVM_MAX_VCPU_ID is not specifying the highest allowed vcpu-id, but the number of allowed vcpu-ids. This has already led to confusion, so rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS to make its semantics more clear Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210913135745.13944-3-jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00
Vitaly Kuznetsov	620b2438ab	KVM: Make kvm_make_vcpus_request_mask() use pre-allocated cpu_kick_mask kvm_make_vcpus_request_mask() already disables preemption so just like kvm_make_all_cpus_request_except() it can be switched to using pre-allocated per-cpu cpumasks. This allows for improvements for both users of the function: in Hyper-V emulation code 'tlb_flush' can now be dropped from 'struct kvm_vcpu_hv' and kvm_make_scan_ioapic_request_mask() gets rid of dynamic allocation. cpumask_available() checks in kvm_make_vcpu_request() and kvm_kick_many_cpus() can now be dropped as they checks for an impossible condition: kvm_init() makes sure per-cpu masks are allocated. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-9-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:04 -04:00

1 2 3 4 5 ...

2380 Commits