43943 Commits

Author SHA1 Message Date
Alexey Dobriyan
70e79866ab ELF: fix all "Elf" typos
ELF is acronym and therefore should be spelled in all caps.

I left one exception at Documentation/arm/nwfpe/nwfpe.rst which looks like
being written in the first person.

Link: https://lkml.kernel.org/r/Y/3wGWQviIOkyLJW@p183
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-08 13:45:37 -07:00
Alyssa Ross
d83806c4c0 purgatory: fix disabling debug info
Since 32ef9e5054ec, -Wa,-gdwarf-2 is no longer used in KBUILD_AFLAGS.
Instead, it includes -g, the appropriate -gdwarf-* flag, and also the
-Wa versions of both of those if building with Clang and GNU as.  As a
result, debug info was being generated for the purgatory objects, even
though the intention was that it not be.

Fixes: 32ef9e5054ec ("Makefile.debug: re-enable debug info for .S files")
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Cc: stable@vger.kernel.org
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-04-08 19:36:53 +09:00
Aaron Lewis
dfdeda67ea KVM: x86/pmu: Prevent the PMU from counting disallowed events
When counting "Instructions Retired" (0xc0) in a guest, KVM will
occasionally increment the PMU counter regardless of if that event is
being filtered. This is because some PMU events are incremented via
kvm_pmu_trigger_event(), which doesn't know about the event filter. Add
the event filter to kvm_pmu_trigger_event(), so events that are
disallowed do not increment their counters.

Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230307141400.1486314-2-aaronlewis@google.com
[sean: prepend "pmc" to the new function]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-07 09:24:16 -07:00
Like Xu
4fa5843d81 KVM: x86/pmu: Fix a typo in kvm_pmu_request_counter_reprogam()
Fix a "reprogam" => "reprogram" typo in kvm_pmu_request_counter_reprogam().

Fixes: 68fb4757e867 ("KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()")
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230310113349.31799-1-likexu@tencent.com
[sean: trim the changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-07 09:07:41 -07:00
Uros Bizjak
f96fb2df3e x86/apic: Fix atomic update of offset in reserve_eilvt_offset()
The detection of atomic update failure in reserve_eilvt_offset() is
not correct. The value returned by atomic_cmpxchg() should be compared
to the old value from the location to be updated.

If these two are the same, then atomic update succeeded and
"eilvt_offsets[offset]" location is updated to "new" in an atomic way.

Otherwise, the atomic update failed and it should be retried with the
value from "eilvt_offsets[offset]" - exactly what atomic_try_cmpxchg()
does in a correct and more optimal way.

Fixes: a68c439b1966c ("apic, x86: Check if EILVT APIC registers are available (AMD only)")
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230227160917.107820-1-ubizjak@gmail.com
2023-04-07 14:34:24 +02:00
Like Xu
649bccd7fa KVM: x86/pmu: Rewrite reprogram_counters() to improve performance
A valid pmc is always tested before using pmu->reprogram_pmi. Eliminate
this part of the redundancy by setting the counter's bitmask directly,
and in addition, trigger KVM_REQ_PMU only once to save more cpu cycles.

Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230214050757.9623-4-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 16:04:31 -07:00
Sean Christopherson
8bca8c5ce4 KVM: VMX: Refactor intel_pmu_{g,}set_msr() to align with other helpers
Invert the flows in intel_pmu_{g,s}et_msr()'s case statements so that
they follow the kernel's preferred style of:

        if (<not valid>)
                return <error>

        <commit change>
        return <success>

which is also the style used by every other {g,s}et_msr() helper (except
AMD's PMU variant, which doesn't use a switch statement).

Modify the "set" paths with costly side effects, i.e. that reprogram
counters, to skip only the side effects, i.e. to perform reserved bits
checks even if the value is unchanged.  None of the reserved bits checks
are expensive, so there's no strong justification for skipping them, and
guarding only the side effect makes it slightly more obvious what is being
skipped and why.

No functional change intended (assuming no reserved bit bugs).

Link: https://lkml.kernel.org/r/Y%2B6cfen%2FCpO3%2FdLO%40google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 16:03:20 -07:00
Like Xu
cdd2fbf636 KVM: x86/pmu: Rename pmc_is_enabled() to pmc_is_globally_enabled()
The name of function pmc_is_enabled() is a bit misleading. A PMC can
be disabled either by PERF_CLOBAL_CTRL or by its corresponding EVTSEL.
Append global semantics to its name.

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230214050757.9623-2-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 15:57:19 -07:00
Sean Christopherson
957d0f70e9 KVM: x86/pmu: Zero out LBR capabilities during PMU refresh
Zero out the LBR capabilities during PMU refresh to avoid exposing LBRs
to the guest against userspace's wishes. If userspace modifies the
guest's CPUID model or invokes KVM_CAP_PMU_CAPABILITY to disable vPMU
after an initial KVM_SET_CPUID2, but before the first KVM_RUN, KVM will
retain the previous LBR info due to bailing before refreshing the LBR
descriptor.

Note, this is a very theoretical bug, there is no known use case where a
VMM would deliberately enable the vPMU via KVM_SET_CPUID2, and then later
disable the vPMU.

Link: https://lore.kernel.org/r/20230311004618.920745-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 14:58:43 -07:00
Sean Christopherson
3a6de51a43 KVM: x86/pmu: WARN and bug the VM if PMU is refreshed after vCPU has run
Now that KVM disallows changing feature MSRs, i.e. PERF_CAPABILITIES,
after running a vCPU, WARN and bug the VM if the PMU is refreshed after
the vCPU has run.

Note, KVM has disallowed CPUID updates after running a vCPU since commit
feb627e8d6f6 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN"), i.e.
PERF_CAPABILITIES was the only remaining way to trigger a PMU refresh
after KVM_RUN.

Cc: Like Xu <like.xu.linux@gmail.com>
Link: https://lore.kernel.org/r/20230311004618.920745-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 14:58:43 -07:00
Sean Christopherson
0094f62c7e KVM: x86: Disallow writes to immutable feature MSRs after KVM_RUN
Disallow writes to feature MSRs after KVM_RUN to prevent userspace from
changing the vCPU model after running the vCPU.  Similar to guest CPUID,
KVM uses feature MSRs to configure intercepts, determine what operations
are/aren't allowed, etc.  Changing the capabilities while the vCPU is
active will at best yield unpredictable guest behavior, and at worst
could be dangerous to KVM.

Allow writing the current value, e.g. so that userspace can blindly set
all MSRs when emulating RESET, and unconditionally allow writes to
MSR_IA32_UCODE_REV so that userspace can emulate patch loads.

Special case the VMX MSRs to keep the generic list small, i.e. so that
KVM can do a linear walk of the generic list without incurring meaningful
overhead.

Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230311004618.920745-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 14:57:23 -07:00
Sean Christopherson
9eb6ba31db KVM: x86: Generate set of VMX feature MSRs using first/last definitions
Add VMX MSRs to the runtime list of feature MSRs by iterating over the
range of emulated MSRs instead of manually defining each MSR in the "all"
list.  Using the range definition reduces the cost of emulating a new VMX
MSR, e.g. prevents forgetting to add an MSR to the list.

Extracting the VMX MSRs from the "all" list, which is a compile-time
constant, also shrinks the list to the point where the compiler can
heavily optimize code that iterates over the list.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230311004618.920745-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 14:57:23 -07:00
Sean Christopherson
5757f5b956 KVM: x86: Add macros to track first...last VMX feature MSRs
Add macros to track the range of VMX feature MSRs that are emulated by
KVM to reduce the maintenance cost of extending the set of emulated MSRs.

Note, KVM doesn't necessarily emulate all known/consumed VMX MSRs, e.g.
PROCBASED_CTLS3 is consumed by KVM to enable IPI virtualization, but is
not emulated as KVM doesn't emulate/virtualize IPI virtualization for
nested guests.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230311004618.920745-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 14:57:22 -07:00
Sean Christopherson
fb3146b4dc KVM: x86: Add a helper to query whether or not a vCPU has ever run
Add a helper to query if a vCPU has run so that KVM doesn't have to open
code the check on last_vmentry_cpu being set to a magic value.

No functional change intended.

Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230311004618.920745-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 14:57:22 -07:00
Sean Christopherson
b1932c5c19 KVM: x86: Rename kvm_init_msr_list() to clarify it inits multiple lists
Rename kvm_init_msr_list() to kvm_init_msr_lists() to clarify that it
initializes multiple lists: MSRs to save, emulated MSRs, and feature MSRs.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230311004618.920745-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 14:57:22 -07:00
Basavaraj Natikar
f195fc1e97 x86/PCI: Add quirk for AMD XHCI controller that loses MSI-X state in D3hot
The AMD [1022:15b8] USB controller loses some internal functional MSI-X
context when transitioning from D0 to D3hot. BIOS normally traps D0->D3hot
and D3hot->D0 transitions so it can save and restore that internal context,
but some firmware in the field can't do this because it fails to clear the
AMD_15B8_RCC_DEV2_EPF0_STRAP2 NO_SOFT_RESET bit.

Clear AMD_15B8_RCC_DEV2_EPF0_STRAP2 NO_SOFT_RESET bit before USB controller
initialization during boot.

Link: https://lore.kernel.org/linux-usb/Y%2Fz9GdHjPyF2rNG3@glanzmann.de/T/#u
Link: https://lore.kernel.org/r/20230329172859.699743-1-Basavaraj.Natikar@amd.com
Reported-by: Thomas Glanzmann <thomas@glanzmann.de>
Tested-by: Thomas Glanzmann <thomas@glanzmann.de>
Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Cc: stable@vger.kernel.org
2023-04-06 16:23:59 -05:00
Kirill A. Shutemov
97740266de x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside
arch_prctl(ARCH_FORCE_TAGGED_SVA) overrides the default and allows LAM
and SVA to co-exist in the process. It is expected by called by the
process when it knows what it is doing.

arch_prctl() operates on the current process, but the same code is
reachable from ptrace where it can be called on arbitrary task.

Make it strict and only allow to set MM_CONTEXT_FORCE_TAGGED_SVA for the
current process.

Fixes: 23e5d9ec2bab ("x86/mm/iommu/sva: Make LAM and SVA mutually exclusive")
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/all/20230403111020.3136-3-kirill.shutemov%40linux.intel.com
2023-04-06 13:45:06 -07:00
Kirill A. Shutemov
fca1fdd2b0 x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA
Normally, LAM and SVA are mutually exclusive. LAM enabling will fail if
SVA is already in use.

Correct error code for the failure. EINTR is nonsensical there.

Fixes: 23e5d9ec2bab ("x86/mm/iommu/sva: Make LAM and SVA mutually exclusive")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/all/CACT4Y+YfqSMsZArhh25TESmG-U4jO5Hjphz87wKSnTiaw2Wrfw@mail.gmail.com
Link: https://lore.kernel.org/all/20230403111020.3136-2-kirill.shutemov%40linux.intel.com
2023-04-06 13:44:58 -07:00
Sean Christopherson
400d213228 KVM: SVM: Return the local "r" variable from svm_set_msr()
Rename "r" to "ret" and actually return it from svm_set_msr() to reduce
the probability of repeating the mistake of commit 723d5fb0ffe4 ("kvm:
svm: Add IA32_FLUSH_CMD guest support"), which set "r" thinking that it
would be propagated to the caller.

Alternatively, the declaration of "r" could be moved into the handling of
MSR_TSC_AUX, but that risks variable shadowing in the future.  A wrapper
for kvm_set_user_return_msr() would allow eliding a local variable, but
that feels like delaying the inevitable.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322011440.2195485-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:37 -04:00
Sean Christopherson
da3db168fb KVM: x86: Virtualize FLUSH_L1D and passthrough MSR_IA32_FLUSH_CMD
Virtualize FLUSH_L1D so that the guest can use the performant L1D flush
if one of the many mitigations might require a flush in the guest, e.g.
Linux provides an option to flush the L1D when switching mms.

Passthrough MSR_IA32_FLUSH_CMD for write when it's supported in hardware
and exposed to the guest, i.e. always let the guest write it directly if
FLUSH_L1D is fully supported.

Forward writes to hardware in host context on the off chance that KVM
ends up emulating a WRMSR, or in the really unlikely scenario where
userspace wants to force a flush.  Restrict these forwarded WRMSRs to
the known command out of an abundance of caution.  Passing through the
MSR means the guest can throw any and all values at hardware, but doing
so in host context is arguably a bit more dangerous.

Link: https://lkml.kernel.org/r/CALMp9eTt3xzAEoQ038bJQ9LN0ZOXrSWsN7xnNUD%2B0SS%3DWwF7Pg%40mail.gmail.com
Link: https://lore.kernel.org/all/20230201132905.549148-2-eesposit@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322011440.2195485-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:37 -04:00
Sean Christopherson
903358c7ed KVM: x86: Move MSR_IA32_PRED_CMD WRMSR emulation to common code
Dedup the handling of MSR_IA32_PRED_CMD across VMX and SVM by moving the
logic to kvm_set_msr_common().  Now that the MSR interception toggling is
handled as part of setting guest CPUID, the VMX and SVM paths are
identical.

Opportunistically massage the code to make it a wee bit denser.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20230322011440.2195485-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:36 -04:00
Sean Christopherson
bff903e8cd KVM: SVM: Passthrough MSR_IA32_PRED_CMD based purely on host+guest CPUID
Passthrough MSR_IA32_PRED_CMD based purely on whether or not the MSR is
supported and enabled, i.e. don't wait until the first write.  There's no
benefit to deferred passthrough, and the extra logic only adds complexity.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322011440.2195485-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:36 -04:00
Sean Christopherson
9a4c485013 KVM: VMX: Passthrough MSR_IA32_PRED_CMD based purely on host+guest CPUID
Passthrough MSR_IA32_PRED_CMD based purely on whether or not the MSR is
supported and enabled, i.e. don't wait until the first write.  There's no
benefit to deferred passthrough, and the extra logic only adds complexity.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20230322011440.2195485-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:36 -04:00
Sean Christopherson
52887af565 KVM: x86: Revert MSR_IA32_FLUSH_CMD.FLUSH_L1D enabling
Revert the recently added virtualizing of MSR_IA32_FLUSH_CMD, as both
the VMX and SVM are fatally buggy to guests that use MSR_IA32_FLUSH_CMD or
MSR_IA32_PRED_CMD, and because the entire foundation of the logic is
flawed.

The most immediate problem is an inverted check on @cmd that results in
rejecting legal values.  SVM doubles down on bugs and drops the error,
i.e. silently breaks all guest mitigations based on the command MSRs.

The next issue is that neither VMX nor SVM was updated to mark
MSR_IA32_FLUSH_CMD as being a possible passthrough MSR,
which isn't hugely problematic, but does break MSR filtering and triggers
a WARN on VMX designed to catch this exact bug.

The foundational issues stem from the MSR_IA32_FLUSH_CMD code reusing
logic from MSR_IA32_PRED_CMD, which in turn was likely copied from KVM's
support for MSR_IA32_SPEC_CTRL.  The copy+paste from MSR_IA32_SPEC_CTRL
was misguided as MSR_IA32_PRED_CMD (and MSR_IA32_FLUSH_CMD) is a
write-only MSR, i.e. doesn't need the same "deferred passthrough"
shenanigans as MSR_IA32_SPEC_CTRL.

Revert all MSR_IA32_FLUSH_CMD enabling in one fell swoop so that there is
no point where KVM advertises, but does not support, L1D_FLUSH.

This reverts commits 45cf86f26148e549c5ba4a8ab32a390e4bde216e,
723d5fb0ffe4c02bd4edf47ea02c02e454719f28, and
a807b78ad04b2eaa348f52f5cc7702385b6de1ee.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lkml.kernel.org/r/20230317190432.GA863767%40dev-arch.thelio-3990X
Cc: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Message-Id: <20230322011440.2195485-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:35 -04:00
Suren Baghdasaryan
0bff0aaea0 x86/mm: try VMA lock-based page fault handling first
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Link: https://lkml.kernel.org/r/20230227173632.3292573-30-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:01 -07:00
Sean Christopherson
098f4c061e KVM: x86/pmu: Disallow legacy LBRs if architectural LBRs are available
Disallow enabling LBR support if the CPU supports architectural LBRs.
Traditional LBR support is absent on CPU models that have architectural
LBRs, and KVM doesn't yet support arch LBRs, i.e. KVM will pass through
non-existent MSRs if userspace enables LBRs for the guest.

Cc: stable@vger.kernel.org
Cc: Yang Weijiang <weijiang.yang@intel.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: be635e34c284 ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES")
Tested-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230128001427.2548858-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-05 17:00:17 -07:00
Tom Rix
944a8dad8b KVM: x86: set "mitigate_smt_rsb" storage-class-specifier to static
smatch reports
arch/x86/kvm/x86.c:199:20: warning: symbol
  'mitigate_smt_rsb' was not declared. Should it be static?

This variable is only used in one file so it should be static.

Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20230404010141.1913667-1-trix@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-05 16:47:35 -07:00
Like Xu
7e768ce827 KVM: x86/pmu: Zero out pmu->all_valid_pmc_idx each time it's refreshed
The kvm_pmu_refresh() may be called repeatedly (e.g. configure guest
CPUID repeatedly or update MSR_IA32_PERF_CAPABILITIES) and each
call will use the last pmu->all_valid_pmc_idx value, with the residual
bits introducing additional overhead later in the vPMU emulation.

Fixes: b35e5548b411 ("KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230404071759.75376-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-05 16:33:10 -07:00
Niklas Schnelle
fcbfe8121a
Kconfig: introduce HAS_IOPORT option and select it as necessary
We introduce a new HAS_IOPORT Kconfig option to indicate support for I/O
Port access. In a future patch HAS_IOPORT=n will disable compilation of
the I/O accessor functions inb()/outb() and friends on architectures
which can not meaningfully support legacy I/O spaces such as s390.

The following architectures do not select HAS_IOPORT:

* ARC
* C-SKY
* Hexagon
* Nios II
* OpenRISC
* s390
* User-Mode Linux
* Xtensa

All other architectures select HAS_IOPORT at least conditionally.

The "depends on" relations on HAS_IOPORT in drivers as well as ifdefs
for HAS_IOPORT specific sections will be added in subsequent patches on
a per subsystem basis.

Co-developed-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@kernel.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net> # for ARCH=um
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-04-05 22:15:19 +02:00
Binbin Wu
548bd27428 KVM: VMX: Use is_64_bit_mode() to check 64-bit mode in SGX handler
sgx_get_encls_gva() uses is_long_mode() to check 64-bit mode, however,
SGX system leaf instructions are valid in compatibility mode, should
use is_64_bit_mode() instead.

Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions")
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230404032502.27798-1-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-05 11:52:19 -07:00
Tony Luck
36168bc061 x86/cpu: Add Xeon Emerald Rapids to list of CPUs that support PPIN
This should be the last addition to this table. Future CPUs will
enumerate PPIN support using CPUID.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230404212124.428118-1-tony.luck@intel.com
2023-04-05 20:01:52 +02:00
Paul E. McKenney
d276134ed4 arch/x86: Remove "select SRCU"
Now that the SRCU Kconfig option is unconditionally selected, there is
no longer any point in selecting it.  Therefore, remove the "select SRCU"
Kconfig statements.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:42 +00:00
Paul E. McKenney
79cf833be6 kvm: Remove "select SRCU"
Now that the SRCU Kconfig option is unconditionally selected, there is
no longer any point in selecting it.  Therefore, remove the "select SRCU"
Kconfig statements from the various KVM Kconfig files.

Acked-by: Sean Christopherson <seanjc@google.com> (x86)
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <kvm@vger.kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org> (arm64)
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Anup Patel <anup@brainfault.org> (riscv)
Acked-by: Heiko Carstens <hca@linux.ibm.com> (s390)
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:42 +00:00
Tony Luck
81515ecf15 x86/cpu: Add model number for Intel Arrow Lake processor
Successor to Lunar Lake.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230404174641.426593-1-tony.luck@intel.com
2023-04-05 13:36:26 +02:00
Oliver Upton
e65733b5c5 KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL
The 'longmode' field is a bit annoying as it blows an entire __u32 to
represent a boolean value. Since other architectures are looking to add
support for KVM_EXIT_HYPERCALL, now is probably a good time to clean it
up.

Redefine the field (and the remaining padding) as a set of flags.
Preserve the existing ABI by using bit 0 to indicate if the guest was in
long mode and requiring that the remaining 31 bits must be zero.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-2-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Kirill A. Shutemov
5462ade687 x86/boot: Centralize __pa()/__va() definitions
Replace multiple __pa()/__va() definitions with a single one in misc.h.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230330114956.20342-2-kirill.shutemov%40linux.intel.com
2023-04-04 13:42:37 -07:00
Vipin Sharma
40fa907e5a KVM: x86/mmu: Merge all handle_changed_pte*() functions
Merge __handle_changed_pte() and handle_changed_spte_acc_track() into a
single function, handle_changed_pte(), as the two are always used
together.  Remove the existing handle_changed_pte(), as it's just a
wrapper that calls __handle_changed_pte() and
handle_changed_spte_acc_track().

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:31 -07:00
Vipin Sharma
1f9973456e KVM: x86/mmu: Remove handle_changed_spte_dirty_log()
Remove handle_changed_spte_dirty_log() as there is no code flow which
sets 4KiB SPTE writable and hit this path. This function marks the page
dirty in a memslot only if new SPTE is 4KiB in size and writable.

Current users of handle_changed_spte_dirty_log() are:
1. set_spte_gfn() - Create only non writable SPTEs.
2. write_protect_gfn() - Change an SPTE to non writable.
3. zap leaf and roots APIs - Everything is 0.
4. handle_removed_pt() - Sets SPTEs to REMOVED_SPTE
5. tdp_mmu_link_sp() - Makes non leaf SPTEs.

There is also no path which creates a writable 4KiB without going
through make_spte() and this functions takes care of marking SPTE dirty
in the memslot if it is PT_WRITABLE.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: add blurb to __handle_changed_spte()'s comment]
Link: https://lore.kernel.org/r/20230321220021.2119033-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
0b7cc2547d KVM: x86/mmu: Remove "record_acc_track" in __tdp_mmu_set_spte()
Remove bool parameter "record_acc_track" from __tdp_mmu_set_spte() and
refactor the code. This variable is always set to true by its caller.

Remove single and double underscore prefix from tdp_mmu_set_spte()
related APIs:
1. Change __tdp_mmu_set_spte() to tdp_mmu_set_spte()
2. Change _tdp_mmu_set_spte() to tdp_mmu_iter_set_spte()

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230321220021.2119033-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
891f115960 KVM: x86/mmu: Bypass __handle_changed_spte() when aging TDP MMU SPTEs
Drop everything except the "tdp_mmu_spte_changed" tracepoint part of
__handle_changed_spte() when aging SPTEs in the TDP MMU, as clearing the
accessed status doesn't affect the SPTE's shadow-present status, whether
or not the SPTE is a leaf, or change the PFN.  I.e. none of the functional
updates handled by __handle_changed_spte() are relevant.

Losing __handle_changed_spte()'s sanity checks does mean that a bug could
theoretical go unnoticed, but that scenario is extremely unlikely, e.g.
would effectively require a misconfigured MMU or a locking bug elsewhere.

Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
6141df067d KVM: x86/mmu: Drop unnecessary dirty log checks when aging TDP MMU SPTEs
Drop the unnecessary call to handle dirty log updates when aging TDP MMU
SPTEs, as neither clearing the Accessed bit nor marking a SPTE for access
tracking can _set_ the Writable bit, i.e. can't trigger marking a gfn
dirty in its memslot.  The access tracking path can _clear_ the Writable
bit, e.g. if the XCHG races with fast_page_fault() and writes the stale
value without the Writable bit set, but clearing the Writable bit outside
of mmu_lock is not allowed, i.e. access tracking can't spuriously set the
Writable bit.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, apply to dirty path, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
7ee131e3a3 KVM: x86/mmu: Clear only A-bit (if enabled) when aging TDP MMU SPTEs
Use tdp_mmu_clear_spte_bits() when clearing the Accessed bit in TDP MMU
SPTEs so as to use an atomic-AND instead of XCHG to clear the A-bit.
Similar to the D-bit story, this will allow KVM to bypass
__handle_changed_spte() by ensuring only the A-bit is modified.

Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
e73008705d KVM: x86/mmu: Remove "record_dirty_log" in __tdp_mmu_set_spte()
Remove bool parameter "record_dirty_log" from __tdp_mmu_set_spte() and
refactor the code as this variable is always set to true by its caller.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230321220021.2119033-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
1e0f42985f KVM: x86/mmu: Bypass __handle_changed_spte() when clearing TDP MMU dirty bits
Drop everything except marking the PFN dirty and the relevant tracepoint
parts of __handle_changed_spte() when clearing the dirty status of gfns in
the TDP MMU.  Clearing only the Dirty (or Writable) bit doesn't affect
the SPTEs shadow-present status, whether or not the SPTE is a leaf, or
change the SPTE's PFN.  I.e. other than marking the PFN dirty, none of the
functional updates handled by __handle_changed_spte() are relevant.

Losing __handle_changed_spte()'s sanity checks does mean that a bug could
theoretical go unnoticed, but that scenario is extremely unlikely, e.g.
would effectively require a misconfigured or a locking bug elsewhere.

Opportunistically remove a comment blurb from __handle_changed_spte()
about all modifications to TDP MMU SPTEs needing to invoke said function,
that "rule" hasn't been true since fast page fault support was added for
the TDP MMU (and perhaps even before).

Tested on a VM (160 vCPUs, 160 GB memory) and found that performance of
clear dirty log stage improved by ~40% in dirty_log_perf_test (with the
full optimization applied).

Before optimization:
--------------------
Iteration 1 clear dirty log time: 3.638543593s
Iteration 2 clear dirty log time: 3.145032742s
Iteration 3 clear dirty log time: 3.142340358s
Clear dirty log over 3 iterations took 9.925916693s. (Avg 3.308638897s/iteration)

After optimization:
-------------------
Iteration 1 clear dirty log time: 2.318988110s
Iteration 2 clear dirty log time: 1.794470164s
Iteration 3 clear dirty log time: 1.791668628s
Clear dirty log over 3 iterations took 5.905126902s. (Avg 1.968375634s/iteration)

Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: split the switch to atomic-AND to a separate patch]
Link: https://lore.kernel.org/r/20230321220021.2119033-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
cf05e8c732 KVM: x86/mmu: Drop access tracking checks when clearing TDP MMU dirty bits
Drop the unnecessary call to handle access-tracking changes when clearing
the dirty status of TDP MMU SPTEs.  Neither the Dirty bit nor the Writable
bit has any impact on the accessed state of a page, i.e. clearing only
the aforementioned bits doesn't make an accessed SPTE suddently not
accessed.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
89c313f20c KVM: x86/mmu: Atomically clear SPTE dirty state in the clear-dirty-log flow
Optimize the clearing of dirty state in TDP MMU SPTEs by doing an
atomic-AND (on SPTEs that have volatile bits) instead of the full XCHG
that currently ends up being invoked (see kvm_tdp_mmu_write_spte()).
Clearing _only_ the bit in question will allow KVM to skip the many
irrelevant checks in __handle_changed_spte() by avoiding any collateral
damage due to the XCHG writing all SPTE bits, e.g. the XCHG could race
with fast_page_fault() setting the W-bit and the CPU setting the D-bit,
and thus incorrectly drop the CPU's D-bit update.

Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: split the switch to atomic-AND to a separate patch]
Link: https://lore.kernel.org/r/20230321220021.2119033-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
697c89bed9 KVM: x86/mmu: Consolidate Dirty vs. Writable clearing logic in TDP MMU
Deduplicate the guts of the TDP MMU's clearing of dirty status by
snapshotting whether to check+clear the Dirty bit vs. the Writable bit,
which is the only difference between the two flavors of dirty tracking.

Note, kvm_ad_enabled() is just a wrapper for shadow_accessed_mask, i.e.
is constant after kvm-{intel,amd}.ko is loaded.

Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, apply to dirty log, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
5982a53926 KVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot
Use the constant-after-module-load kvm_ad_enabled() to check if SPTEs in
the TDP MMU need to be write-protected when clearing accessed/dirty status
instead of manually checking every SPTE.  The per-SPTE A/D enabling is
specific to nested EPT MMUs, i.e. when KVM is using EPT A/D bits but L1 is
not, and so cannot happen in the TDP MMU (which is non-nested only).

Keep the original code as sanity checks buried under MMU_WARN_ON().
MMU_WARN_ON() is more or less useless at the moment, but there are plans
to change that.

Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, apply to dirty path, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
41e07665f1 KVM: x86/mmu: Add a helper function to check if an SPTE needs atomic write
Move conditions in kvm_tdp_mmu_write_spte() to check if an SPTE should
be written atomically or not to a separate function.

This new function, kvm_tdp_mmu_spte_need_atomic_write(),  will be used
in future commits to optimize clearing bits in SPTEs.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Link: https://lore.kernel.org/r/20230321220021.2119033-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Linus Torvalds
76f598ba7d PPC:
* Hide KVM_CAP_IRQFD_RESAMPLE if XIVE is enabled
 
 s390:
 
 * Fix handling of external interrupts in protected guests
 
 x86:
 
 * Resample the pending state of IOAPIC interrupts when unmasking them
 
 * Fix usage of Hyper-V "enlightened TLB" on AMD
 
 * Small fixes to real mode exceptions
 
 * Suppress pending MMIO write exits if emulator detects exception
 
 Documentation:
 
 * Fix rST syntax
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmQsXMQUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOc6Qf/YMDcYxpPuYFZ69t9k4s0ubeGjGpr
 Qdzo6ZgKbfs4+nu89u2F5zvAftRSrEVQRpfLwmTXauM8el3s+EAEPlK/MLH6sSSo
 LG3eWoNxs7MUQ6gBWeQypY6opvz3W9hf8KYd+GY6TV7jwV5/YgH6674Yyd7T/bCm
 0/tUiil5Q7Da0w4GUY5m9gEqlbA9yjDBFySqqTZemo18NzLaPNpesKnm1hvv0QJG
 JRvOMmHwqAZgS0G5cyGEPZKY6Ga0v9QsAXc8mFC/Jtn9GMD2f47PYDRfd/7NSiVz
 ulgsk8zZkZIZ6nMbQ1t10txmGIbFmJ4iPPaATrMII/xHq+idgAlHU7zY1A==
 =QvFj
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "PPC:
   - Hide KVM_CAP_IRQFD_RESAMPLE if XIVE is enabled

  s390:
   - Fix handling of external interrupts in protected guests

  x86:
   - Resample the pending state of IOAPIC interrupts when unmasking them

   - Fix usage of Hyper-V "enlightened TLB" on AMD

   - Small fixes to real mode exceptions

   - Suppress pending MMIO write exits if emulator detects exception

  Documentation:
   - Fix rST syntax"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  docs: kvm: x86: Fix broken field list
  KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent
  KVM: s390: pv: fix external interruption loop not always detected
  KVM: nVMX: Do not report error code when synthesizing VM-Exit from Real Mode
  KVM: x86: Clear "has_error_code", not "error_code", for RM exception injection
  KVM: x86: Suppress pending MMIO write exits if emulator detects exception
  KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking
  KVM: irqfd: Make resampler_list an RCU list
  KVM: SVM: Flush Hyper-V TLB when required
2023-04-04 11:29:37 -07:00