IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
When vcpu->arch.cr3 is changed, it should be marked dirty unless it
is being updated to the value of the architecture guest CR3 (i.e.
VMX.GUEST_CR3 or vmcb->save.cr3 when tdp is enabled).
This patch has no functionality changed because
kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3) is superset of
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3) with additional
change to vcpu->arch.regs_dirty, but no code uses regs_dirty for
VCPU_EXREG_CR3. (vmx_load_mmu_pgd() uses vcpu->arch.regs_avail instead
to test if VCPU_EXREG_CR3 dirty which means current code (ab)uses
regs_avail for VCPU_EXREG_CR3 dirty information.)
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-11-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Not functionality changed.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-7-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In set_cr4_guest_host_mask(), all cr4 pdptr bits are already set to be
intercepted in an unclear way.
Add X86_CR4_PDPTR_BITS to make it clear and self-documented.
No functionality changed.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-6-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For VMX with EPT, dirty PDPTRs need to be loaded before the next vmentry
via vmx_load_mmu_pgd()
But not all paths that call load_pdptrs() will cause vmx_load_mmu_pgd()
to be invoked. Normally, kvm_mmu_reset_context() is used to cause
KVM_REQ_LOAD_MMU_PGD, but sometimes it is skipped:
* commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
when changing CR0.CD and CR0.NW.
* commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
PCID on MOV CR3 w/ flush") skips KVM_REQ_LOAD_MMU_PGD after
load_pdptrs() when rewriting the CR3 with the same value.
* commit a91a7c709600("KVM: X86: Don't reset mmu context when
toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
load_pdptrs() when changing CR4.PGE.
Fixes: d81135a57aa6 ("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
Fixes: 21823fbda552 ("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
Fixes: a91a7c709600 ("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Call kvm_vcpu_block() directly for all wait states except HALTED so that
kvm_vcpu_halt() is no longer a misnomer on x86.
Functionally, this means KVM will never attempt halt-polling or adjust
vcpu->halt_poll_ns for INIT_RECEIVED (a.k.a. Wait-For-SIPI (WFS)) or
AP_RESET_HOLD; UNINITIALIZED is handled in kvm_arch_vcpu_ioctl_run(),
and x86 doesn't use any other "wait" states.
As mentioned above, the motivation of this is purely so that "halt" isn't
overloaded on x86, e.g. in KVM's stats. Skipping halt-polling for WFS
(and RESET_HOLD) has no meaningful effect on guest performance as there
are typically single-digit numbers of INIT-SIPI sequences per AP vCPU,
per boot, versus thousands of HLTs just to boot to console.
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Go directly to kvm_vcpu_block() when handling the case where userspace
attempts to run an UNINITIALIZED vCPU. The vCPU is not halted, nor is it
likely that halt-polling will be successful in this case.
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename kvm_vcpu_block() to kvm_vcpu_halt() in preparation for splitting
the actual "block" sequences into a separate helper (to be named
kvm_vcpu_block()). x86 will use the standalone block-only path to handle
non-halt cases where the vCPU is not runnable.
Rename block_ns to halt_ns to match the new function name.
No functional change intended.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename a variety of HLT-related helpers to free up the function name
"kvm_vcpu_halt" for future use in generic KVM code, e.g. to differentiate
between "block" and "halt".
No functional change intended.
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is no point in recalculating from scratch the total number of pages
in all memslots each time a memslot is created or deleted. Use KVM's
cached nr_memslot_pages to compute the default max number of MMU pages.
Note that even with nr_memslot_pages capped at ULONG_MAX we can't safely
multiply it by KVM_PERMILLE_MMU_PAGES (20) since this operation can
possibly overflow an unsigned long variable.
Write this "* 20 / 1000" operation as "/ 50" instead to avoid such
overflow.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
[sean: use common KVM field and rework changelog accordingly]
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <d14c5a24535269606675437d5602b7dac4ad8c0e.1638817640.git.maciej.szmigiero@oracle.com>
There is no point in calling kvm_mmu_change_mmu_pages() for memslot
operations that don't change the total page count, so do it just for
KVM_MR_CREATE and KVM_MR_DELETE.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <9e56b7616a11f5654e4ab486b3237366b7ba9f2a.1638817640.git.maciej.szmigiero@oracle.com>
Play nice with a NULL @old or @new when handling memslot updates so that
common KVM can pass NULL for one or the other in CREATE and DELETE cases
instead of having to synthesize a dummy memslot.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <2eb7788adbdc2bc9a9c5f86844dd8ee5c8428732.1638817640.git.maciej.szmigiero@oracle.com>
Drop the @mem param from kvm_arch_{prepare,commit}_memory_region() now
that its use has been removed in all architectures.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <aa5ed3e62c27e881d0d8bc0acbc1572bc336dc19.1638817640.git.maciej.szmigiero@oracle.com>
Get the number of pages directly from the new memslot instead of
computing the same from the userspace memory region when allocating
memslot metadata. This will allow a future patch to drop @mem.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <ef44892eb615f5c28e682bbe06af96aff9ce2a9f.1638817639.git.maciej.szmigiero@oracle.com>
Pass the "old" slot to kvm_arch_prepare_memory_region() and force arch
code to handle propagating arch specific data from "new" to "old" when
necessary. This is a baby step towards dynamically allocating "new" from
the get go, and is a (very) minor performance boost on x86 due to not
unnecessarily copying arch data.
For PPC HV, copy the rmap in the !CREATE and !DELETE paths, i.e. for MOVE
and FLAGS_ONLY. This is functionally a nop as the previous behavior
would overwrite the pointer for CREATE, and eventually discard/ignore it
for DELETE.
For x86, copy the arch data only for FLAGS_ONLY changes. Unlike PPC HV,
x86 needs to reallocate arch data in the MOVE case as the size of x86's
allocations depend on the alignment of the memslot's gfn.
Opportunistically tweak kvm_arch_prepare_memory_region()'s param order to
match the "commit" prototype.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
[mss: add missing RISCV kvm_arch_prepare_memory_region() change]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <67dea5f11bbcfd71e3da5986f11e87f5dd4013f9.1638817639.git.maciej.szmigiero@oracle.com>
Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu
index. Unfortunately, we're about to move rework the iterator,
which requires this to be upgrade to an unsigned long.
Let's bite the bullet and repaint all of it in one go.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-7-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All architectures have similar loops iterating over the vcpus,
freeing one vcpu at a time, and eventually wiping the reference
off the vcpus array. They are also inconsistently taking
the kvm->lock mutex when wiping the references from the array.
Make this code common, which will simplify further changes.
The locking is dropped altogether, as this should only be called
when there is no further references on the kvm structure.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-2-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel
local APIC, however kvm_apicv_activated might still be true if there are
no reasons to disable APICv; in fact it is quite likely that there is none
because APICv is inhibited by specific configurations of the local APIC
and those configurations cannot be programmed. This triggers a WARN:
WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
To avoid this, introduce another cause for APICv inhibition, namely the
absence of an in-kernel local APIC. This cause is enabled by default,
and is dropped by either KVM_CREATE_IRQCHIP or the enabling of
KVM_CAP_IRQCHIP_SPLIT.
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Fixes: ee49a8932971 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22)
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Ignat Korchagin <ignat@cloudflare.com>
Message-Id: <20211130123746.293379-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The IRTE for an assigned device can trigger a POSTED_INTR_VECTOR even
if APICv is disabled on the vCPU that receives it. In that case, the
interrupt will just cause a vmexit and leave the ON bit set together
with the PIR bit corresponding to the interrupt.
Right now, the interrupt would not be delivered until APICv is re-enabled.
However, fixing this is just a matter of always doing the PIR->IRR
synchronization, even if the vCPU has temporarily disabled APICv.
This is not a problem for performance, or if anything it is an
improvement. First, in the common case where vcpu->arch.apicv_active is
true, one fewer check has to be performed. Second, static_call_cond will
elide the function call if APICv is not present or disabled. Finally,
in the case for AMD hardware we can remove the sync_pir_to_irr callback:
it is only needed for apic_has_interrupt_for_ppr, and that function
already has a fallback for !APICv.
Cc: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Message-Id: <20211123004311.2954158-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 63f5a1909f9e ("KVM: x86: Alert userspace that KVM_SET_CPUID{,2}
after KVM_RUN is broken") officially deprecated KVM_SET_CPUID{,2} ioctls
after first successful KVM_RUN and promissed to make this sequence forbiden
in 5.16. It's time to fulfil the promise.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211122175818.608220-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at
nested transitions, as KVM doesn't track requests for L1 vs L2. E.g. if
there's a pending flush when a nested VM-Exit occurs, then the flush was
requested in the context of L2 and needs to be handled before switching
to L1, otherwise the flush for L2 would effectiely be lost.
Opportunistically add a helper to handle CURRENT and GUEST as a pair, the
logic for when they need to be serviced is identical as both requests are
tied to L1 vs. L2, the only difference is the scope of the flush.
Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e34 ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The capability, albeit present, was never exposed via KVM_CHECK_EXTENSION.
Fixes: b56639318bb2 ("KVM: SEV: Add support for SEV intra host migration")
Cc: Peter Gonda <pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Synchronize the two calls to kvm_x86_sync_pir_to_irr. The one
in the reenter-guest fast path invoked the callback unconditionally
even if LAPIC is present but disabled. In this case, there are
no interrupts to deliver, and therefore posted interrupts can
be ignored.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It doesn't make sense to return the recommended maximum number of
vCPUs which exceeds the maximum possible number of vCPUs.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211116163443.88707-7-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When processing a hypercall for a guest with protected state, currently
SEV-ES guests, the guest CS segment register can't be checked to
determine if the guest is in 64-bit mode. For an SEV-ES guest, it is
expected that communication between the guest and the hypervisor is
performed to shared memory using the GHCB. In order to use the GHCB, the
guest must have been in long mode, otherwise writes by the guest to the
GHCB would be encrypted and not be able to be comprehended by the
hypervisor.
Create a new helper function, is_64_bit_hypercall(), that assumes the
guest is in 64-bit mode when the guest has protected state, and returns
true, otherwise invoking is_64_bit_mode() to determine the mode. Update
the hypercall related routines to use is_64_bit_hypercall() instead of
is_64_bit_mode().
Add a WARN_ON_ONCE() to is_64_bit_mode() to catch occurences of calls to
this helper function for a guest running with protected state.
Fixes: f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e0b20c770c9d0d1403f23d83e785385104211f74.1621878537.git.thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Fixes for Xen emulation
* Kill kvm_map_gfn() / kvm_unmap_gfn() and broken gfn_to_pfn_cache
* Fixes for migration of 32-bit nested guests on 64-bit hypervisor
* Compilation fixes
* More SEV cleanups
In 64-bit mode, x86 instruction encoding allows us to use the low 8 bits
of any GPR as an 8-bit operand. In 32-bit mode, however, we can only use
the [abcd] registers. For which, GCC has the "q" constraint instead of
the less restrictive "r".
Also fix st->preempted, which is an input/output operand rather than an
input.
Fixes: 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <89bf72db1b859990355f9c40713a34e0d2d86c98.camel@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In vcpu_load_eoi_exitmap(), currently the eoi_exit_bitmap[4] array is
initialized only when Hyper-V context is available, in other path it is
just passed to kvm_x86_ops.load_eoi_exitmap() directly from on the stack,
which would cause unexpected interrupt delivery/handling issues, e.g. an
*old* linux kernel that relies on PIT to do clock calibration on KVM might
randomly fail to boot.
Fix it by passing ioapic_handled_vectors to load_eoi_exitmap() when Hyper-V
context is not available.
Fixes: f2bc14b69c38 ("KVM: x86: hyper-v: Prepare to meet unallocated Hyper-V context")
Cc: stable@vger.kernel.org
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Huang Le <huangle1@jd.com>
Message-Id: <62115b277dab49ea97da5633f8522daf@jd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When UBSAN is enabled, the code emitted for the call to guest_pv_has
includes a call to __ubsan_handle_load_invalid_value. objtool
complains that this call happens with UACCESS enabled; to avoid
the warning, pull the calls to user_access_begin into both arms
of the "if" statement, after the check for guest_pv_has.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Fix misuse of gfn-to-pfn cache when recording guest steal time / preempted status
* Fix selftests on APICv machines
* Fix sparse warnings
* Fix detection of KVM features in CPUID
* Cleanups for bogus writes to MSR_KVM_PV_EOI_EN
* Fixes and cleanups for MSR bitmap handling
* Cleanups for INVPCID
* Make x86 KVM_SOFT_MAX_VCPUS consistent with other architectures
Add support for AMD SEV and SEV-ES intra-host migration support. Intra
host migration provides a low-cost mechanism for userspace VMM upgrades.
In the common case for intra host migration, we can rely on the normal
ioctls for passing data from one VMM to the next. SEV, SEV-ES, and other
confidential compute environments make most of this information opaque, and
render KVM ioctls such as "KVM_GET_REGS" irrelevant. As a result, we need
the ability to pass this opaque metadata from one VMM to the next. The
easiest way to do this is to leave this data in the kernel, and transfer
ownership of the metadata from one KVM VM (or vCPU) to the next. In-kernel
hand off makes it possible to move any data that would be
unsafe/impossible for the kernel to hand directly to userspace, and
cannot be reproduced using data that can be handed to userspace.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_CAP_NR_VCPUS is used to get the "recommended" maximum number of
VCPUs and arm64/mips/riscv report num_online_cpus(). Powerpc reports
either num_online_cpus() or num_present_cpus(), s390 has multiple
constants depending on hardware features. On x86, KVM reports an
arbitrary value of '710' which is supposed to be the maximum tested
value but it's possible to test all KVM_MAX_VCPUS even when there are
less physical CPUs available.
Drop the arbitrary '710' value and return num_online_cpus() on x86 as
well. The recommendation will match other architectures and will mean
'no CPU overcommit'.
For reference, QEMU only queries KVM_CAP_NR_VCPUS to print a warning
when the requested vCPU number exceeds it. The static limit of '710'
is quite weird as smaller systems with just a few physical CPUs should
certainly "recommend" less.
Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211111134733.86601-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Handle #GP on INVPCID due to an invalid type in the common switch
statement instead of relying on the callers (VMX and SVM) to manually
validate the type.
Unlike INVVPID and INVEPT, INVPCID is not explicitly documented to check
the type before reading the operand from memory, so deferring the
type validity check until after that point is architecturally allowed.
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109174426.2350547-3-vipinsh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_lapic_enable_pv_eoi() is a misnomer as the function is also
used to disable PV EOI. Rename it to kvm_lapic_set_pv_eoi().
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211108152819.12485-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_GUESTDBG_BLOCKIRQ relies on interrupts being injected using
standard kvm's inject_pending_event, and not via APICv/AVIC.
Since this is a debug feature, just inhibit APICv/AVIC while
KVM_GUESTDBG_BLOCKIRQ is in use on at least one vCPU.
Fixes: 61e5f69ef0837 ("KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ")
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211108090245.166408-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These function names sound like predicates, and they have siblings,
*is_valid_msr(), which _are_ predicates. Moreover, there are comments
that essentially warn that these functions behave unexpectedly.
Flip the polarity of the return values, so that they become
predicates, and convert the boolean result to a success/failure code
at the outer call site.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211105202058.1048757-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In commit b043138246a4 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is
not missed") we switched to using a gfn_to_pfn_cache for accessing the
guest steal time structure in order to allow for an atomic xchg of the
preempted field. This has a couple of problems.
Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the
atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a
guest vCPU using an IOMEM page for its steal time would never have its
preempted field set.
Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it
should have been. There are two stages to the GFN->PFN conversion;
first the GFN is converted to a userspace HVA, and then that HVA is
looked up in the process page tables to find the underlying host PFN.
Correct invalidation of the latter would require being hooked up to the
MMU notifiers, but that doesn't happen---so it just keeps mapping and
unmapping the *wrong* PFN after the userspace page tables change.
In the !IOMEM case at least the stale page *is* pinned all the time it's
cached, so it won't be freed and reused by anyone else while still
receiving the steal time updates. The map/unmap dance only takes care
of the KVM administrivia such as marking the page dirty.
Until the gfn_to_pfn cache handles the remapping automatically by
integrating with the MMU notifiers, we might as well not get a
kernel mapping of it, and use the perfectly serviceable userspace HVA
that we already have. We just need to implement the atomic xchg on
the userspace address with appropriate exception handling, which is
fairly trivial.
Cc: stable@vger.kernel.org
Fixes: b043138246a4 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <3645b9b889dac6438394194bb5586a46b68d581f.camel@infradead.org>
[I didn't entirely agree with David's assessment of the
usefulness of the gfn_to_pfn cache, and integrated the outcome
of the discussion in the above commit message. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV to work with intra host migration, contents of the SEV info struct
such as the ASID (used to index the encryption key in the AMD SP) and
the list of memory regions need to be transferred to the target VM.
This change adds a commands for a target VMM to get a source SEV VM's sev
info.
Signed-off-by: Peter Gonda <pgonda@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <20211021174303.385706-3-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Generalize KVM_REQ_VM_BUGGED so that it can be called even in cases
where it is by design that the VM cannot be operated upon. In this
case any KVM_BUG_ON should still warn, so introduce a new flag
kvm->vm_dead that is separate from kvm->vm_bugged.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* More progress on the protected VM front, now with the full
fixed feature set as well as the limitation of some hypercalls
after initialisation.
* Cleanup of the RAZ/WI sysreg handling, which was pointlessly
complicated
* Fixes for the vgic placement in the IPA space, together with a
bunch of selftests
* More memcg accounting of the memory allocated on behalf of a guest
* Timer and vgic selftests
* Workarounds for the Apple M1 broken vgic implementation
* KConfig cleanups
* New kvmarm.mode=none option, for those who really dislike us
RISC-V:
* New KVM port.
x86:
* New API to control TSC offset from userspace
* TSC scaling for nested hypervisors on SVM
* Switch masterclock protection from raw_spin_lock to seqcount
* Clean up function prototypes in the page fault code and avoid
repeated memslot lookups
* Convey the exit reason to userspace on emulation failure
* Configure time between NX page recovery iterations
* Expose Predictive Store Forwarding Disable CPUID leaf
* Allocate page tracking data structures lazily (if the i915
KVM-GT functionality is not compiled in)
* Cleanups, fixes and optimizations for the shadow MMU code
s390:
* SIGP Fixes
* initial preparations for lazy destroy of secure VMs
* storage key improvements/fixes
* Log the guest CPNC
Starting from this release, KVM-PPC patches will come from
Michael Ellerman's PPC tree.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGBOiEUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNowwf/axlx3g9sgCwQHr12/6UF/7hL/RwP
9z+pGiUzjl2YQE+RjSvLqyd6zXh+h4dOdOKbZDLSkSTbcral/8U70ojKnQsXM0XM
1LoymxBTJqkgQBLm9LjYreEbzrPV4irk4ygEmuk3CPOHZu8xX1ei6c5LdandtM/n
XVUkXsQY+STkmnGv4P3GcPoDththCr0tBTWrFWtxa0w9hYOxx0ay1AZFlgM4FFX0
QFuRc8VBLoDJpIUjbkhsIRIbrlHc/YDGjuYnAU7lV/CIME8vf2BW6uBwIZJdYcDj
0ejozLjodEnuKXQGnc8sXFioLX2gbMyQJEvwCgRvUu/EU7ncFm1lfs7THQ==
=UxKM
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- More progress on the protected VM front, now with the full fixed
feature set as well as the limitation of some hypercalls after
initialisation.
- Cleanup of the RAZ/WI sysreg handling, which was pointlessly
complicated
- Fixes for the vgic placement in the IPA space, together with a
bunch of selftests
- More memcg accounting of the memory allocated on behalf of a guest
- Timer and vgic selftests
- Workarounds for the Apple M1 broken vgic implementation
- KConfig cleanups
- New kvmarm.mode=none option, for those who really dislike us
RISC-V:
- New KVM port.
x86:
- New API to control TSC offset from userspace
- TSC scaling for nested hypervisors on SVM
- Switch masterclock protection from raw_spin_lock to seqcount
- Clean up function prototypes in the page fault code and avoid
repeated memslot lookups
- Convey the exit reason to userspace on emulation failure
- Configure time between NX page recovery iterations
- Expose Predictive Store Forwarding Disable CPUID leaf
- Allocate page tracking data structures lazily (if the i915 KVM-GT
functionality is not compiled in)
- Cleanups, fixes and optimizations for the shadow MMU code
s390:
- SIGP Fixes
- initial preparations for lazy destroy of secure VMs
- storage key improvements/fixes
- Log the guest CPNC
Starting from this release, KVM-PPC patches will come from Michael
Ellerman's PPC tree"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
RISC-V: KVM: fix boolreturn.cocci warnings
RISC-V: KVM: remove unneeded semicolon
RISC-V: KVM: Fix GPA passed to __kvm_riscv_hfence_gvma_xyz() functions
RISC-V: KVM: Factor-out FP virtualization into separate sources
KVM: s390: add debug statement for diag 318 CPNC data
KVM: s390: pv: properly handle page flags for protected guests
KVM: s390: Fix handle_sske page fault handling
KVM: x86: SGX must obey the KVM_INTERNAL_ERROR_EMULATION protocol
KVM: x86: On emulation failure, convey the exit reason, etc. to userspace
KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info
KVM: x86: Clarify the kvm_run.emulation_failure structure layout
KVM: s390: Add a routine for setting userspace CPU state
KVM: s390: Simplify SIGP Set Arch handling
KVM: s390: pv: avoid stalls when making pages secure
KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm
KVM: s390: pv: avoid double free of sida page
KVM: s390: pv: add macros for UVC CC values
s390/mm: optimize reset_guest_reference_bit()
s390/mm: optimize set_guest_storage_key()
s390/mm: no need for pte_alloc_map_lock() if we know the pmd is present
...
- Cleanup of extable fixup handling to be more robust, which in turn
allows to make the FPU exception fixups more robust as well.
- Change the return code for signal frame related failures from explicit
error codes to a boolean fail/success as that's all what the calling
code evaluates.
- A large refactoring of the FPU code to prepare for adding AMX support:
- Distangle the public header maze and remove especially the misnomed
kitchen sink internal.h which is despite it's name included all over
the place.
- Add a proper abstraction for the register buffer storage (struct
fpstate) which allows to dynamically size the buffer at runtime by
flipping the pointer to the buffer container from the default
container which is embedded in task_struct::tread::fpu to a
dynamically allocated container with a larger register buffer.
- Convert the code over to the new fpstate mechanism.
- Consolidate the KVM FPU handling by moving the FPU related code into
the FPU core which removes the number of exports and avoids adding
even more export when AMX has to be supported in KVM. This also
removes duplicated code which was of course unnecessary different and
incomplete in the KVM copy.
- Simplify the KVM FPU buffer handling by utilizing the new fpstate
container and just switching the buffer pointer from the user space
buffer to the KVM guest buffer when entering vcpu_run() and flipping
it back when leaving the function. This cuts the memory requirements
of a vCPU for FPU buffers in half and avoids pointless memory copy
operations.
This also solves the so far unresolved problem of adding AMX support
because the current FPU buffer handling of KVM inflicted a circular
dependency between adding AMX support to the core and to KVM. With
the new scheme of switching fpstate AMX support can be added to the
core code without affecting KVM.
- Replace various variables with proper data structures so the extra
information required for adding dynamically enabled FPU features (AMX)
can be added in one place
- Add AMX (Advanved Matrix eXtensions) support (finally):
AMX is a large XSTATE component which is going to be available with
Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD)
which allows to trap the (first) use of an AMX related instruction,
which has two benefits:
1) It allows the kernel to control access to the feature
2) It allows the kernel to dynamically allocate the large register
state buffer instead of burdening every task with the the extra 8K
or larger state storage.
It would have been great to gain this kind of control already with
AVX512.
The support comes with the following infrastructure components:
1) arch_prctl() to
- read the supported features (equivalent to XGETBV(0))
- read the permitted features for a task
- request permission for a dynamically enabled feature
Permission is granted per process, inherited on fork() and cleared
on exec(). The permission policy of the kernel is restricted to
sigaltstack size validation, but the syscall obviously allows
further restrictions via seccomp etc.
2) A stronger sigaltstack size validation for sys_sigaltstack(2) which
takes granted permissions and the potentially resulting larger
signal frame into account. This mechanism can also be used to
enforce factual sigaltstack validation independent of dynamic
features to help with finding potential victims of the 2K
sigaltstack size constant which is broken since AVX512 support was
added.
3) Exception handling for #NM traps to catch first use of a extended
feature via a new cause MSR. If the exception was caused by the use
of such a feature, the handler checks permission for that
feature. If permission has not been granted, the handler sends a
SIGILL like the #UD handler would do if the feature would have been
disabled in XCR0. If permission has been granted, then a new fpstate
which fits the larger buffer requirement is allocated.
In the unlikely case that this allocation fails, the handler sends
SIGSEGV to the task. That's not elegant, but unavoidable as the
other discussed options of preallocation or full per task
permissions come with their own set of horrors for kernel and/or
userspace. So this is the lesser of the evils and SIGSEGV caused by
unexpected memory allocation failures is not a fundamentally new
concept either.
When allocation succeeds, the fpstate properties are filled in to
reflect the extended feature set and the resulting sizes, the
fpu::fpstate pointer is updated accordingly and the trap is disarmed
for this task permanently.
4) Enumeration and size calculations
5) Trap switching via MSR_XFD
The XFD (eXtended Feature Disable) MSR is context switched with the
same life time rules as the FPU register state itself. The mechanism
is keyed off with a static key which is default disabled so !AMX
equipped CPUs have zero overhead. On AMX enabled CPUs the overhead
is limited by comparing the tasks XFD value with a per CPU shadow
variable to avoid redundant MSR writes. In case of switching from a
AMX using task to a non AMX using task or vice versa, the extra MSR
write is obviously inevitable.
All other places which need to be aware of the variable feature sets
and resulting variable sizes are not affected at all because they
retrieve the information (feature set, sizes) unconditonally from
the fpstate properties.
6) Enable the new AMX states
Note, this is relatively new code despite the fact that AMX support is in
the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper
integration has been started exactly 3 weeks ago. Refactoring of the
existing FPU code and of the original AMX patches took a week and has
been subject to extensive review and testing. The only fallout which has
not been caught in review and testing right away was restricted to AMX
enabled systems, which is completely irrelevant for anyone outside Intel
and their early access program. There might be dragons lurking as usual,
but so far the fine grained refactoring has held up and eventual yet
undetected fallout is bisectable and should be easily addressable before
the 5.16 release. Famous last words...
Many thanks to Chang Bae and Dave Hansen for working hard on this and
also to the various test teams at Intel who reserved extra capacity to
follow the rapid development of this closely which provides the
confidence level required to offer this rather large update for inclusion
into 5.16-rc1.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/NkITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodDkEADH4+/nN/QoSUHIuuha5Zptj3g2b16a
/3TxT9fhwPen/kzMGsUk70s3iWJMA+I5dCfkSZexJ2hfhcRe9cBzZIa1HCawKwf3
YCISTsO/M+LpeORuZ+TpfFLJKnxNr1SEOl+EYffGhq0AkCjifb9Cnr0JZuoMUzGU
jpfJZ2bj28ri5lG812DtzSMBM9E3SAwgJv+GNjmZbxZKb9mAfhbAMdBUXHirX7Ej
jmx6koQjYOKwYIW8w1BrdC270lUKQUyJTbQgdRkN9Mh/HnKyFixQ18JqGlgaV2cT
EtYePUfTEdaHdAhUINLIlEug1MfOslHU+HyGsdywnoChNB4GHPQuePC5Tz60VeFN
RbQ9aKcBUu8r95rjlnKtAtBijNMA4bjGwllVxNwJ/ZoA9RPv1SbDZ07RX3qTaLVY
YhVQl8+shD33/W24jUTJv1kMMexpHXIlv0gyfMryzpwI7uzzmGHRPAokJdbYKctC
dyMPfdE90rxTiMUdL/1IQGhnh3awjbyfArzUhHyQ++HyUyzCFh0slsO0CD18vUy8
FofhCugGBhjuKw3XwLNQ+KsWURz5qHctSzBc3qMOSyqFHbAJCVRANkhsFvWJo2qL
75+Z7OTRebtsyOUZIdq26r4roSxHrps3dupWTtN70HWx2NhQG1nLEw986QYiQu1T
hcKvDmehQLrUvg==
=x3WL
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner:
- Cleanup of extable fixup handling to be more robust, which in turn
allows to make the FPU exception fixups more robust as well.
- Change the return code for signal frame related failures from
explicit error codes to a boolean fail/success as that's all what the
calling code evaluates.
- A large refactoring of the FPU code to prepare for adding AMX
support:
- Distangle the public header maze and remove especially the
misnomed kitchen sink internal.h which is despite it's name
included all over the place.
- Add a proper abstraction for the register buffer storage (struct
fpstate) which allows to dynamically size the buffer at runtime
by flipping the pointer to the buffer container from the default
container which is embedded in task_struct::tread::fpu to a
dynamically allocated container with a larger register buffer.
- Convert the code over to the new fpstate mechanism.
- Consolidate the KVM FPU handling by moving the FPU related code
into the FPU core which removes the number of exports and avoids
adding even more export when AMX has to be supported in KVM.
This also removes duplicated code which was of course
unnecessary different and incomplete in the KVM copy.
- Simplify the KVM FPU buffer handling by utilizing the new
fpstate container and just switching the buffer pointer from the
user space buffer to the KVM guest buffer when entering
vcpu_run() and flipping it back when leaving the function. This
cuts the memory requirements of a vCPU for FPU buffers in half
and avoids pointless memory copy operations.
This also solves the so far unresolved problem of adding AMX
support because the current FPU buffer handling of KVM inflicted
a circular dependency between adding AMX support to the core and
to KVM. With the new scheme of switching fpstate AMX support can
be added to the core code without affecting KVM.
- Replace various variables with proper data structures so the
extra information required for adding dynamically enabled FPU
features (AMX) can be added in one place
- Add AMX (Advanced Matrix eXtensions) support (finally):
AMX is a large XSTATE component which is going to be available with
Saphire Rapids XEON CPUs. The feature comes with an extra MSR
(MSR_XFD) which allows to trap the (first) use of an AMX related
instruction, which has two benefits:
1) It allows the kernel to control access to the feature
2) It allows the kernel to dynamically allocate the large register
state buffer instead of burdening every task with the the extra
8K or larger state storage.
It would have been great to gain this kind of control already with
AVX512.
The support comes with the following infrastructure components:
1) arch_prctl() to
- read the supported features (equivalent to XGETBV(0))
- read the permitted features for a task
- request permission for a dynamically enabled feature
Permission is granted per process, inherited on fork() and
cleared on exec(). The permission policy of the kernel is
restricted to sigaltstack size validation, but the syscall
obviously allows further restrictions via seccomp etc.
2) A stronger sigaltstack size validation for sys_sigaltstack(2)
which takes granted permissions and the potentially resulting
larger signal frame into account. This mechanism can also be used
to enforce factual sigaltstack validation independent of dynamic
features to help with finding potential victims of the 2K
sigaltstack size constant which is broken since AVX512 support
was added.
3) Exception handling for #NM traps to catch first use of a extended
feature via a new cause MSR. If the exception was caused by the
use of such a feature, the handler checks permission for that
feature. If permission has not been granted, the handler sends a
SIGILL like the #UD handler would do if the feature would have
been disabled in XCR0. If permission has been granted, then a new
fpstate which fits the larger buffer requirement is allocated.
In the unlikely case that this allocation fails, the handler
sends SIGSEGV to the task. That's not elegant, but unavoidable as
the other discussed options of preallocation or full per task
permissions come with their own set of horrors for kernel and/or
userspace. So this is the lesser of the evils and SIGSEGV caused
by unexpected memory allocation failures is not a fundamentally
new concept either.
When allocation succeeds, the fpstate properties are filled in to
reflect the extended feature set and the resulting sizes, the
fpu::fpstate pointer is updated accordingly and the trap is
disarmed for this task permanently.
4) Enumeration and size calculations
5) Trap switching via MSR_XFD
The XFD (eXtended Feature Disable) MSR is context switched with
the same life time rules as the FPU register state itself. The
mechanism is keyed off with a static key which is default
disabled so !AMX equipped CPUs have zero overhead. On AMX enabled
CPUs the overhead is limited by comparing the tasks XFD value
with a per CPU shadow variable to avoid redundant MSR writes. In
case of switching from a AMX using task to a non AMX using task
or vice versa, the extra MSR write is obviously inevitable.
All other places which need to be aware of the variable feature
sets and resulting variable sizes are not affected at all because
they retrieve the information (feature set, sizes) unconditonally
from the fpstate properties.
6) Enable the new AMX states
Note, this is relatively new code despite the fact that AMX support
is in the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper
integration has been started exactly 3 weeks ago. Refactoring of the
existing FPU code and of the original AMX patches took a week and has
been subject to extensive review and testing. The only fallout which
has not been caught in review and testing right away was restricted
to AMX enabled systems, which is completely irrelevant for anyone
outside Intel and their early access program. There might be dragons
lurking as usual, but so far the fine grained refactoring has held up
and eventual yet undetected fallout is bisectable and should be
easily addressable before the 5.16 release. Famous last words...
Many thanks to Chang Bae and Dave Hansen for working hard on this and
also to the various test teams at Intel who reserved extra capacity
to follow the rapid development of this closely which provides the
confidence level required to offer this rather large update for
inclusion into 5.16-rc1
* tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
Documentation/x86: Add documentation for using dynamic XSTATE features
x86/fpu: Include vmalloc.h for vzalloc()
selftests/x86/amx: Add context switch test
selftests/x86/amx: Add test cases for AMX state management
x86/fpu/amx: Enable the AMX feature in 64-bit mode
x86/fpu: Add XFD handling for dynamic states
x86/fpu: Calculate the default sizes independently
x86/fpu/amx: Define AMX state components and have it used for boot-time checks
x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers
x86/fpu/xstate: Add fpstate_realloc()/free()
x86/fpu/xstate: Add XFD #NM handler
x86/fpu: Update XFD state where required
x86/fpu: Add sanity checks for XFD
x86/fpu: Add XFD state to fpstate
x86/msr-index: Add MSRs for XFD
x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit
x86/fpu: Reset permission and fpstate on exec()
x86/fpu: Prepare fpu_clone() for dynamically enabled features
x86/fpu/signal: Prepare for variable sigframe length
x86/signal: Use fpu::__state_user_size for sigalt stack validation
...
* Fixes for Xen emulator bugs showing up as debug kernel WARNs
* Fix another issue with SEV/ES string I/O VMGEXITs
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmF6uGIUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNRagf/Srvk9lNcRh4cEzsczErKMyr3xOqA
jgsTSqgl1ExJI9sBLMpVYBOFGILMaMSrhLPIltKPy0Bj/E+hw8WOQwPa44QjWlSD
MAUxO1Nryt9Luc2L8uSd1c//g4fr4V1BhOaumk1lM14Q8EDfQBcDIMI2ZKueMU1+
2Q+n8/AsG63jQIINwKNidof0dzRtbfcE30Wq/8QHttIPo5wt6l0YClOlOikqNY8N
5+WSQFmuutHIXftq5Jb/Ldn/+HVukWZyZOEVwLnBpM9uBvIubNgcEakqvxsaVtAn
FHdvnA+Bk99/Xuhl+wRLQo8ofzQIQ13RQv3HPArJAJv34oAJZx2rNObVlA==
=6ofB
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- Fixes for s390 interrupt delivery
- Fixes for Xen emulator bugs showing up as debug kernel WARNs
- Fix another issue with SEV/ES string I/O VMGEXITs
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Take srcu lock in post_kvm_run_save()
KVM: SEV-ES: fix another issue with string I/O VMGEXITs
KVM: x86/xen: Fix kvm_xen_has_interrupt() sleeping in kvm_vcpu_block()
KVM: x86: switch pvclock_gtod_sync_lock to a raw spinlock
KVM: s390: preserve deliverable_mask in __airqs_kick_single_vcpu
KVM: s390: clear kicked_mask before sleeping again
Should instruction emulation fail, include the VM exit reason, etc. in
the emulation_failure data passed to userspace, in order that the VMM
can report it as a debugging aid when describing the failure.
Suggested-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210920103737.2696756-4-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For the upcoming AMX support it's necessary to do a proper integration with
KVM. Currently KVM allocates two FPU structs which are used for saving the user
state of the vCPU thread and restoring the guest state when entering
vcpu_run() and doing the reverse operation before leaving vcpu_run().
With the new fpstate mechanism this can be reduced to one extra buffer by
swapping the fpstate pointer in current:🧵:fpu. This makes the
upcoming support for AMX and XFD simpler because then fpstate information
(features, sizes, xfd) are always consistent and it does not require any
nasty workarounds.
Convert the KVM FPU code over to this new scheme.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211022185313.019454292@linutronix.de
Use a rw_semaphore instead of a mutex to coordinate APICv updates so that
vCPUs responding to requests can take the lock for read and run in
parallel. Using a mutex forces serialization of vCPUs even though
kvm_vcpu_update_apicv() only touches data local to that vCPU or is
protected by a different lock, e.g. SVM's ir_list_lock.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211022004927.1448382-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move SVM's assertion that vCPU's APICv state is consistent with its VM's
state out of svm_vcpu_run() and into x86's common inner run loop. The
assertion and underlying logic is not unique to SVM, it's just that SVM
has more inhibiting conditions and thus is more likely to run headfirst
into any KVM bugs.
Add relevant comments to document exactly why the update path has unusual
ordering between the update the kick, why said ordering is safe, and also
the basic rules behind the assertion in the run loop.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211022004927.1448382-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The PIO scratch buffer is larger than a single page, and therefore
it is not possible to copy it in a single step to vcpu->arch/pio_data.
Bound each call to emulator_pio_in/out to a single page; keep
track of how many I/O operations are left in vcpu->arch.sev_pio_count,
so that the operation can be restarted in the complete_userspace_io
callback.
For OUT, this means that the previous kvm_sev_es_outs implementation
becomes an iterator of the loop, and we can consume the sev_pio_data
buffer before leaving to userspace.
For IN, instead, consuming the buffer and decreasing sev_pio_count
is always done in the complete_userspace_io callback, because that
is when the memcpy is done into sev_pio_data.
Cc: stable@vger.kernel.org
Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest")
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make the diff a little nicer when we actually get to fixing
the bug. No functional change intended.
Cc: stable@vger.kernel.org
Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
complete_emulator_pio_in can expect that vcpu->arch.pio has been filled in,
and therefore does not need the size and count arguments. This makes things
nicer when the function is called directly from a complete_userspace_io
callback.
No functional change intended.
Cc: stable@vger.kernel.org
Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>