IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Update the target pCPU for IOMMU doorbells when updating IRTE routing if
KVM is actively running the associated vCPU. KVM currently only updates
the pCPU when loading the vCPU (via avic_vcpu_load()), and so doorbell
events will be delayed until the vCPU goes through a put+load cycle (which
might very well "never" happen for the lifetime of the VM).
To avoid inserting a stale pCPU, e.g. due to racing between updating IRTE
routing and vCPU load/put, get the pCPU information from the vCPU's
Physical APIC ID table entry (a.k.a. avic_physical_id_cache in KVM) and
update the IRTE while holding ir_list_lock. Add comments with --verbose
enabled to explain exactly what is and isn't protected by ir_list_lock.
Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt")
Reported-by: dengqiao.joey <dengqiao.joey@bytedance.com>
Cc: stable@vger.kernel.org
Cc: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20230808233132.2499764-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Hoist the acquisition of ir_list_lock from avic_update_iommu_vcpu_affinity()
to its two callers, avic_vcpu_load() and avic_vcpu_put(), specifically to
encapsulate the write to the vCPU's entry in the AVIC Physical ID table.
This will allow a future fix to pull information from the Physical ID entry
when updating the IRTE, without potentially consuming stale information,
i.e. without racing with the vCPU being (un)loaded.
Add a comment to call out that ir_list_lock does NOT protect against
multiple writers, specifically that reading the Physical ID entry in
avic_vcpu_put() outside of the lock is safe.
To preserve some semblance of independence from ir_list_lock, keep the
READ_ONCE() in avic_vcpu_load() even though acuiring the spinlock
effectively ensures the load(s) will be generated after acquiring the
lock.
Cc: stable@vger.kernel.org
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20230808233132.2499764-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use the recently introduced svm_get_lbr_vmcb() instead an open coded
equivalent to retrieve the target VMCB when emulating writes to
MSR_IA32_DEBUGCTLMSR.
No functional change intended.
Link: https://lore.kernel.org/r/20230607203519.1570167-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Clean up the enable_lbrv computation in svm_update_lbrv() to consolidate
the logic for computing enable_lbrv into a single statement, and to remove
the coding style violations (lack of curly braces on nested if).
No functional change intended.
Link: https://lore.kernel.org/r/20230607203519.1570167-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Refactor KVM's handling of LBR MSRs on SVM to avoid a second layer of
case statements, and thus eliminate a dead KVM_BUG() call, which (a) will
never be hit in the current code base and (b) if a future commit breaks
things, will never fire as KVM passes "false" instead "true" or '1' for
the KVM_BUG() condition.
Reported-by: Michal Luczaj <mhal@rbox.co>
Cc: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230607203519.1570167-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Bail early from svm_enable_nmi_window() for SEV-ES guests without trying
to enable single-step of the guest, as single-stepping an SEV-ES guest is
impossible and the guest is responsible for *telling* KVM when it is ready
for an new NMI to be injected.
Functionally, setting TF and RF in svm->vmcb->save.rflags is benign as the
field is ignored by hardware, but it's all kinds of confusing.
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Link: https://lore.kernel.org/r/20230615063757.3039121-10-aik@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Immediately mark NMIs as unmasked in response to #VMGEXIT(NMI complete)
instead of setting awaiting_iret_completion and waiting until the *next*
VM-Exit to unmask NMIs. The whole point of "NMI complete" is that the
guest is responsible for telling the hypervisor when it's safe to inject
an NMI, i.e. there's no need to wait. And because there's no IRET to
single-step, the next VM-Exit could be a long time coming, i.e. KVM could
incorrectly hold an NMI pending for far longer than what is required and
expected.
Opportunistically fix a stale reference to HF_IRET_MASK.
Fixes: 916b54a7688b ("KVM: x86: Move HF_NMI_MASK and HF_IRET_MASK into "struct vcpu_svm"")
Fixes: 4444dfe4050b ("KVM: SVM: Add NMI support for an SEV-ES guest")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230615063757.3039121-9-aik@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disable #DB for SEV-ES guests when DebugSwap is enabled. There is no point
in such intercept as KVM does not allow guest debug for SEV-ES guests.
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Link: https://lore.kernel.org/r/20230615063757.3039121-8-aik@amd.com
[sean: add comment as to why KVM disables #DB intercept iff DebugSwap=1]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add support for "DebugSwap for SEV-ES guests", which provides support
for swapping DR[0-3] and DR[0-3]_ADDR_MASK on VMRUN and VMEXIT, i.e.
allows KVM to expose debug capabilities to SEV-ES guests. Without
DebugSwap support, the CPU doesn't save/load most _guest_ debug
registers (except DR6/7), and KVM cannot manually context switch guest
DRs due the VMSA being encrypted.
Enable DebugSwap if and only if the CPU also supports NoNestedDataBp,
which causes the CPU to ignore nested #DBs, i.e. #DBs that occur when
vectoring a #DB. Without NoNestedDataBp, a malicious guest can DoS
the host by putting the CPU into an infinite loop of vectoring #DBs
(see https://bugzilla.redhat.com/show_bug.cgi?id=1278496)
Set the features bit in sev_es_sync_vmsa() which is the last point
when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most
init happens) is called not only when VCPU is initialised but also on
intrahost migration when VMSA is encrypted.
Eliminate DR7 intercepts as KVM can't modify guest DR7, and intercepting
DR7 would completely defeat the purpose of enabling DebugSwap.
Make X86_FEATURE_DEBUG_SWAP appear in /proc/cpuinfo (by not adding "") to
let the operator know if the VM can debug.
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Link: https://lore.kernel.org/r/20230615063757.3039121-7-aik@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Currently SVM setup is done sequentially in
init_vmcb() -> sev_init_vmcb() -> sev_es_init_vmcb()
and tries keeping SVM/SEV/SEV-ES bits separated. One of the exceptions
is DR intercepts which is for SEV-ES before sev_es_init_vmcb() runs.
Move the SEV-ES intercept setup to sev_es_init_vmcb(). From now on
set_dr_intercepts()/clr_dr_intercepts() handle SVM/SEV only.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Santosh Shukla <santosh.shukla@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230615063757.3039121-6-aik@amd.com
[sean: drop comment about intercepting DR7]
Signed-off-by: Sean Christopherson <seanjc@google.com>
SVM/SEV enable debug registers intercepts to skip swapping DRs
on entering/exiting the guest. When the guest is in control of
debug registers (vcpu->guest_debug == 0), there is an optimisation to
reduce the number of context switches: intercepts are cleared and
the KVM_DEBUGREG_WONT_EXIT flag is set to tell KVM to do swapping
on guest enter/exit.
The same code also executes for SEV-ES, however it has no effect as
- it always takes (vcpu->guest_debug == 0) branch;
- KVM_DEBUGREG_WONT_EXIT is set but DR7 intercept is not cleared;
- vcpu_enter_guest() writes DRs but VMRUN for SEV-ES swaps them
with the values from _encrypted_ VMSA.
Be explicit about SEV-ES not supporting debug:
- return right away from dr_interception() and skip unnecessary processing;
- return an error right away from the KVM_SEV_LAUNCH_UPDATE_VMSA handler
if debugging was already enabled.
KVM_SET_GUEST_DEBUG are failing already after KVM_SEV_LAUNCH_UPDATE_VMSA
is finished due to vcpu->arch.guest_state_protected set to true.
Add WARN_ON to kvm_x86::sync_dirty_debug_regs() (saves guest DRs on
guest exit) to signify that SEV-ES won't hit that path.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Link: https://lore.kernel.org/r/20230615063757.3039121-5-aik@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rewrite the comment(s) in sev_es_prepare_switch_to_guest() to explain the
swap types employed by the CPU for SEV-ES guests, i.e. to explain why KVM
needs to save a seemingly random subset of host state, and to provide a
decoder for the APM's Type-A/B/C terminology.
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Link: https://lore.kernel.org/r/20230615063757.3039121-4-aik@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Currently SVM setup is done sequentially in
init_vmcb() -> sev_init_vmcb() -> sev_es_init_vmcb() and tries
keeping SVM/SEV/SEV-ES bits separated. One of the exceptions
is #GP intercept which init_vmcb() skips setting for SEV guests and
then sev_es_init_vmcb() needlessly clears it.
Remove the SEV check from init_vmcb(). Clear the #GP intercept in
sev_init_vmcb(). SEV-ES will use the SEV setting.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Carlos Bilbao <carlos.bilbao@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Santosh Shukla <santosh.shukla@amd.com>
Link: https://lore.kernel.org/r/20230615063757.3039121-3-aik@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Static functions set_dr_intercepts() and clr_dr_intercepts() are only
called from SVM so move them to .c.
No functional change intended.
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Carlos Bilbao <carlos.bilbao@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Santosh Shukla <santosh.shukla@amd.com>
Link: https://lore.kernel.org/r/20230615063757.3039121-2-aik@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
* Eager page splitting optimization for dirty logging, optionally
allowing for a VM to avoid the cost of hugepage splitting in the stage-2
fault path.
* Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact with
services that live in the Secure world. pKVM intervenes on FF-A calls
to guarantee the host doesn't misuse memory donated to the hyp or a
pKVM guest.
* Support for running the split hypervisor with VHE enabled, known as
'hVHE' mode. This is extremely useful for testing the split
hypervisor on VHE-only systems, and paves the way for new use cases
that depend on having two TTBRs available at EL2.
* Generalized framework for configurable ID registers from userspace.
KVM/arm64 currently prevents arbitrary CPU feature set configuration
from userspace, but the intent is to relax this limitation and allow
userspace to select a feature set consistent with the CPU.
* Enable the use of Branch Target Identification (FEAT_BTI) in the
hypervisor.
* Use a separate set of pointer authentication keys for the hypervisor
when running in protected mode, as the host is untrusted at runtime.
* Ensure timer IRQs are consistently released in the init failure
paths.
* Avoid trapping CTR_EL0 on systems with Enhanced Virtualization Traps
(FEAT_EVT), as it is a register commonly read from userspace.
* Erratum workaround for the upcoming AmpereOne part, which has broken
hardware A/D state management.
RISC-V:
* Redirect AMO load/store misaligned traps to KVM guest
* Trap-n-emulate AIA in-kernel irqchip for KVM guest
* Svnapot support for KVM Guest
s390:
* New uvdevice secret API
* CMM selftest and fixes
* fix racy access to target CPU for diag 9c
x86:
* Fix missing/incorrect #GP checks on ENCLS
* Use standard mmu_notifier hooks for handling APIC access page
* Drop now unnecessary TR/TSS load after VM-Exit on AMD
* Print more descriptive information about the status of SEV and SEV-ES during
module load
* Add a test for splitting and reconstituting hugepages during and after
dirty logging
* Add support for CPU pinning in demand paging test
* Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes
included along the way
* Add a "nx_huge_pages=never" option to effectively avoid creating NX hugepage
recovery threads (because nx_huge_pages=off can be toggled at runtime)
* Move handling of PAT out of MTRR code and dedup SVM+VMX code
* Fix output of PIC poll command emulation when there's an interrupt
* Add a maintainer's handbook to document KVM x86 processes, preferred coding
style, testing expectations, etc.
* Misc cleanups, fixes and comments
Generic:
* Miscellaneous bugfixes and cleanups
Selftests:
* Generate dependency files so that partial rebuilds work as expected
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmSgHrIUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroORcAf+KkBlXwQMf+Q0Hy6Mfe0OtkKmh0Ae
6HJ6dsuMfOHhWv5kgukh+qvuGUGzHq+gpVKmZg2yP3h3cLHOLUAYMCDm+rjXyjsk
F4DbnJLfxq43Pe9PHRKFxxSecRcRYCNox0GD5UYL4PLKcH0FyfQrV+HVBK+GI8L3
FDzUcyJkR12Lcj1qf++7fsbzfOshL0AJPmidQCoc6wkLJpUEr/nYUqlI1Kx3YNuQ
LKmxFHS4l4/O/px3GKNDrLWDbrVlwciGIa3GZLS52PZdW3mAqT+cqcPcYK6SW71P
m1vE80VbNELX5q3YSRoOXtedoZ3Pk97LEmz/xQAsJ/jri0Z5Syk0Ok0m/Q==
=AMXp
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Eager page splitting optimization for dirty logging, optionally
allowing for a VM to avoid the cost of hugepage splitting in the
stage-2 fault path.
- Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact
with services that live in the Secure world. pKVM intervenes on
FF-A calls to guarantee the host doesn't misuse memory donated to
the hyp or a pKVM guest.
- Support for running the split hypervisor with VHE enabled, known as
'hVHE' mode. This is extremely useful for testing the split
hypervisor on VHE-only systems, and paves the way for new use cases
that depend on having two TTBRs available at EL2.
- Generalized framework for configurable ID registers from userspace.
KVM/arm64 currently prevents arbitrary CPU feature set
configuration from userspace, but the intent is to relax this
limitation and allow userspace to select a feature set consistent
with the CPU.
- Enable the use of Branch Target Identification (FEAT_BTI) in the
hypervisor.
- Use a separate set of pointer authentication keys for the
hypervisor when running in protected mode, as the host is untrusted
at runtime.
- Ensure timer IRQs are consistently released in the init failure
paths.
- Avoid trapping CTR_EL0 on systems with Enhanced Virtualization
Traps (FEAT_EVT), as it is a register commonly read from userspace.
- Erratum workaround for the upcoming AmpereOne part, which has
broken hardware A/D state management.
RISC-V:
- Redirect AMO load/store misaligned traps to KVM guest
- Trap-n-emulate AIA in-kernel irqchip for KVM guest
- Svnapot support for KVM Guest
s390:
- New uvdevice secret API
- CMM selftest and fixes
- fix racy access to target CPU for diag 9c
x86:
- Fix missing/incorrect #GP checks on ENCLS
- Use standard mmu_notifier hooks for handling APIC access page
- Drop now unnecessary TR/TSS load after VM-Exit on AMD
- Print more descriptive information about the status of SEV and
SEV-ES during module load
- Add a test for splitting and reconstituting hugepages during and
after dirty logging
- Add support for CPU pinning in demand paging test
- Add support for AMD PerfMonV2, with a variety of cleanups and minor
fixes included along the way
- Add a "nx_huge_pages=never" option to effectively avoid creating NX
hugepage recovery threads (because nx_huge_pages=off can be toggled
at runtime)
- Move handling of PAT out of MTRR code and dedup SVM+VMX code
- Fix output of PIC poll command emulation when there's an interrupt
- Add a maintainer's handbook to document KVM x86 processes,
preferred coding style, testing expectations, etc.
- Misc cleanups, fixes and comments
Generic:
- Miscellaneous bugfixes and cleanups
Selftests:
- Generate dependency files so that partial rebuilds work as
expected"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (153 commits)
Documentation/process: Add a maintainer handbook for KVM x86
Documentation/process: Add a label for the tip tree handbook's coding style
KVM: arm64: Fix misuse of KVM_ARM_VCPU_POWER_OFF bit index
RISC-V: KVM: Remove unneeded semicolon
RISC-V: KVM: Allow Svnapot extension for Guest/VM
riscv: kvm: define vcpu_sbi_ext_pmu in header
RISC-V: KVM: Expose IMSIC registers as attributes of AIA irqchip
RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC
RISC-V: KVM: Expose APLIC registers as attributes of AIA irqchip
RISC-V: KVM: Add in-kernel emulation of AIA APLIC
RISC-V: KVM: Implement device interface for AIA irqchip
RISC-V: KVM: Skeletal in-kernel AIA irqchip support
RISC-V: KVM: Set kvm_riscv_aia_nr_hgei to zero
RISC-V: KVM: Add APLIC related defines
RISC-V: KVM: Add IMSIC related defines
RISC-V: KVM: Implement guest external interrupt line management
KVM: x86: Remove PRIx* definitions as they are solely for user space
s390/uv: Update query for secret-UVCs
s390/uv: replace scnprintf with sysfs_emit
s390/uvdevice: Add 'Lock Secret Store' UVC
...
- Drop manual TR/TSS load after VM-Exit now that KVM uses VMLOAD for host state
- Fix a not-yet-problematic missing call to trace_kvm_exit() for VM-Exits that
are handled in the fastpath
- Print more descriptive information about the status of SEV and SEV-ES during
module load
- Assert that misc_cg_set_capacity() doesn't fail to avoid should-be-impossible
memory leaks
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaK5ESHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5ApAQAJSlzFh6Cg6OzlKqsrXDTTrNuP7Yu5Pe
tCQm9uppab++TyBz00GCoaUjqXJY1j1riStbl0j1yGJ69Ocjrqj58IGGeoj4NKgt
Dsiwgs0IWCshe7noVcYQeC4FInrNiFOog7Zog7uDyJmtHprZHorcJ9rmsBXMmedw
OSrzoxyhVwbtbPmgMfEP+xw4wccXVioci4EOySqI0GI9QrQ+cfdafs8irxxeLG7v
IY1qG3fwNmGp2uHdb3lG48TUbggWzKG5o1RC+fwwN132Y2fepxjcAeZ25gNms3lz
Q1fm7vPNkGRqelqg7x+z9B10D6uJc0hngZPe6Hs8C7y1+hvTjXwmx81WXsQxM7RM
rhhbp1o1C0xKSLzFciaZyW4lQW4cw5wxGRNoIenpHUe48bK9wjTYxez2MiQwfbNJ
Dt9RAaBVF/UdNBZu2wtA3czgHwOHKSqUOwO2N2iBW62KgRzITQe9r9VtVikslbQD
/nAq7PJOGz8JuJXkDWI0nLYEW6pInzsiXB21CPQrYR8XOQnnWglzmMTL/KxPeVYg
pBHJUf6U7AdhjHMkPp2Yc1eQTNspDzRfZBGFZz1YS103JpmUIs97W0phHru/ONKh
1cBv2N9ZrOJhuL1LAxaLp9OSvR+UQP/mMdCAEjvUTpEbmqbtEAqWMkTTwIEdIi74
PcaJfv7GsYJV
=rcFn
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.5' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.5:
- Drop manual TR/TSS load after VM-Exit now that KVM uses VMLOAD for host state
- Fix a not-yet-problematic missing call to trace_kvm_exit() for VM-Exits that
are handled in the fastpath
- Print more descriptive information about the status of SEV and SEV-ES during
module load
- Assert that misc_cg_set_capacity() doesn't fail to avoid should-be-impossible
memory leaks
- Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes
included along the way
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaHFgSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5twMP/15ZJFqZVigVQoATJeeR9tWUuyJe95xM
lyfnTel91Sg8XOamdwBGi7jLpaDgj34Jm0cfM7/4LbJk2/taeaCLYmJd5w9FXvaw
EkytQGO85hVNe2XuY+h+XxSIxpflKxgFuUnOwcDk2QbKgASzNSG/mJ9ZBx8PNVXD
FnyOqpbbYDFspWWvUOAI/RkHnr/dALjXJsSUMvuh3nz5e1NTyubjCAZg+/bse2nR
s8FrcSh4B0Lg0h4r2fdJ4sAiM/qWhcCIhq5svyTAcUG0T4rMS40LrosJOw3wkBRM
dyZYXy6GEENeCFJPhenF1mTE1embFyZp89PV/FCNRZXODbnM4kheJFT9gucAjlKi
ZafRcutrkYIVf4lZCMofDfQGLX/GCEJnwUPKyGygIsPoDRrdR7OLrFycON5bxocr
9NBNG+2teQFbnt5irB/bBGojtIZtu3OEylkuRjQUQ3lJYQ5r6LddarI9acIu1SHt
4rRfh8QN5qmMvVblaQzggOr6BPtmPr8QqMEMFncaUMCsV/82hRAEfvj2rifGFJNo
Axz1ajMfirxyM45WzredUkzzsbphiiegPBELCLRZfHmaEhJ8P7t7wvri0bXt9YdI
vjSfX+6ulOgDC+xAazE0gEJO4Uh5+g3Y+1e0fr43ltWzUOWdCQskzD3LE9DkqIXj
KAaCuHYbYpIZ
=MwqV
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.5' of https://github.com/kvm-x86/linux into HEAD
KVM x86/pmu changes for 6.5:
- Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes
included along the way
- Add back a comment about the subtle side effect of try_cmpxchg64() in
tdp_mmu_set_spte_atomic()
- Add an assertion in __kvm_mmu_invalidate_addr() to verify that the target
KVM MMU is the current MMU
- Add a "never" option to effectively avoid creating NX hugepage recovery
threads
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaGrgSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5KG0P/iFP2w0PAUQgIgbWeOWiIh1ZXq9JjjX+
GAjii1deMFzuEjOWTzWUY2FDE7Mzea5hsVoOmLY7kb9jwYwPJDhaKlaQbNLYgFAr
uJqGg8ZMRIbXGBhX98Z4qrUzqKwjgKDzswu/Fg6xZOhVLKNoIkV/YwVo3b1dZ8+e
ecctLJFtmV/xa7cFyOTnB9rDgUuBXc2jB7+7eza6+oFlhO/S1VB2XPBq+IT3KoPO
F40YQW05ortC8IaFHHkJSRTfVM8v+2WDzrwpJUtyalDAie4hhy08svCEX2cXEeYX
qWgjzPzQLM6AcFhb491M1BjFiEYuh5qhvETK+1PiIXTTq+xaIDb1HSM9BexkSVBR
scHt8RdamPq3noqZQgMEIzVHp5L3k72oy1iP0k2uIzirMW9v+M3MWLsQuDV+CaLU
+EYQozWNEcDO7b/gpYsGWG9Me11GibqIJeyLJFU7HwAmACBiRyy6RD+RS7NMGzHB
9HT6TkSbPc1+cLJ5npCFwZBkj+vwjPs3lEjVQkiGZtavt1nWHfE8ASdv+hNwnJg/
Xz+PVdKh6g0A3mUqxz/ZuDTp3Hfz3jL1roYFGIAUXAjaebih0MUc/CYf84VvoqIq
IymI37EoK9CnMszQGJBc2IeB+Bc8KptYCs+M0WYNQ7MPcLIJHKpIzFPosRb2qyOj
/GEYFjfFwaPR
=FM8H
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.5' of https://github.com/kvm-x86/linux into HEAD
KVM x86/mmu changes for 6.5:
- Add back a comment about the subtle side effect of try_cmpxchg64() in
tdp_mmu_set_spte_atomic()
- Add an assertion in __kvm_mmu_invalidate_addr() to verify that the target
KVM MMU is the current MMU
- Add a "never" option to effectively avoid creating NX hugepage recovery
threads
- Move handling of PAT out of MTRR code and dedup SVM+VMX code
- Fix output of PIC poll command emulation when there's an interrupt
- Fix a longstanding bug in the reporting of the number of entries returned by
KVM_GET_CPUID2
- Add a maintainer's handbook to document KVM x86 processes, preferred coding
style, testing expectations, etc.
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaGMMSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5iDIP/0PwY3J5odTEUTnAyuDFPimd5PBt9k/O
B414wdpSKVgzq+0An4qM9mKRnklVIh2p8QqQTvDhcBUg3xb6CX9xZ4ery7hp/T5O
tr5bAXs2AYX6jpxvsopt+w+E9j6fvkJhcJCRU9im3QbrqwUE+ecyU5OHvmv2n/GO
syVZJbPOYuoLPKDjlSMrScE6fWEl9UOvHc5BK/vafTeyisMG3vv1BSmJj6GuiNNk
TS1RRIg//cOZghQyDfdXt0azTmakNZyNn35xnoX9x8SRmdRykyUjQeHmeqWxPDso
kiGO+CGancfS57S6ZtCkJjqEWZ1o/zKdOxr8MMf/3nJhv4kY7/5XtlVoACv5soW9
bZEmNiXIaSbvKNMwAlLJxHFbLa1sMdSCb345CIuMdt5QiWJ53ZiTyIAJX6+eL+Zf
8nkeekgPf5VUs6Zt0RdRPyvo+W7Vp9BtI87yDXm1nQKpbys2pt6CD3YB/oF4QViG
a5cyGoFuqRQbS3nmbshIlR7EanTuxbhLZKrNrFnolZ5e624h3Cnk2hVsfTznVGiX
vNHWM80phk1CWB9McErrZVkGfjlyVyBL13CBB2XF7Dl6PfF6/N22a9bOuTJD3tvk
PlNx4hvZm3esvvyGpjfbSajTKYE8O7rxiE1KrF0BpZ5IUl5WSiTr6XCy/yI/mIeM
hay2IWhPOF2z
=D0BH
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.5' of https://github.com/kvm-x86/linux into HEAD
KVM x86 changes for 6.5:
* Move handling of PAT out of MTRR code and dedup SVM+VMX code
* Fix output of PIC poll command emulation when there's an interrupt
* Add a maintainer's handbook to document KVM x86 processes, preferred coding
style, testing expectations, etc.
* Misc cleanups
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double().
The cmpxchg128() family of functions is basically & functionally
the same as cmpxchg_double(), but with a saner interface: instead
of a 6-parameter horror that forced u128 - u64/u64-halves layout
details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128 types.
- Restructure the generated atomic headers, and add
kerneldoc comments for all of the generic atomic{,64,_long}_t
operations. Generated definitions are much cleaner now,
and come with documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering
when taking multiple locks of the same type. This gets rid of
one use of lockdep_set_novalidate_class() in the bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended
variable shadowing generating garbage code on Clang on certain
ARM builds.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSav3wRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gDyxAAjCHQjpolrre7fRpyiTDwqzIKT27H04vQ
zrQVlVc42WBnn9pe8LthGy43/RvYvqlZvLoLONA4fMkuYriM6nSMsoZjeUmE+6Rs
QAElQC74P5YvEBOa67VNY3/M7sj22ftDe7ODtVV8OrnPjMk1sQNRvaK025Cs3yig
8MAI//hHGNmyVAp1dPYZMJNqxGCvluReLZ4SaUJFCMrg7YgUXgCBj/5Gi07TlKxn
sT8BFCssoEW/B9FXkh59B1t6FBCZoSy4XSZfsZe0uVAUJ4XDEOO+zBgaWFCedNQT
wP323ryBgMrkzUKA8j2/o5d3QnMA1GcBfHNNlvAl/fOfrxWXzDZnOEY26YcaLMa0
YIuRF/JNbPZlt6DCUVBUEvMPpfNYi18dFN0rat1a6xL2L4w+tm55y3mFtSsg76Ka
r7L2nWlRrAGXnuA+VEPqkqbSWRUSWOv5hT2Mcyb5BqqZRsxBETn6G8GVAzIO6j6v
giyfUdA8Z9wmMZ7NtB6usxe3p1lXtnZ/shCE7ZHXm6xstyZrSXaHgOSgAnB9DcuJ
7KpGIhhSODQSwC/h/J0KEpb9Pr/5jCWmXAQ2DWnZK6ndt1jUfFi8pfK58wm0AuAM
o9t8Mx3o8wZjbMdt6up9OIM1HyFiMx2BSaZK+8f/bWemHQ0xwez5g4k5O5AwVOaC
x9Nt+Tp0Ze4=
=DsYj
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
The cmpxchg128() family of functions is basically & functionally the
same as cmpxchg_double(), but with a saner interface.
Instead of a 6-parameter horror that forced u128 - u64/u64-halves
layout details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128
types.
- Restructure the generated atomic headers, and add kerneldoc comments
for all of the generic atomic{,64,_long}_t operations.
The generated definitions are much cleaner now, and come with
documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
taking multiple locks of the same type.
This gets rid of one use of lockdep_set_novalidate_class() in the
bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
shadowing generating garbage code on Clang on certain ARM builds.
* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
locking/atomic: treewide: delete arch_atomic_*() kerneldoc
locking/atomic: docs: Add atomic operations to the driver basic API documentation
locking/atomic: scripts: generate kerneldoc comments
docs: scripts: kernel-doc: accept bitwise negation like ~@var
locking/atomic: scripts: simplify raw_atomic*() definitions
locking/atomic: scripts: simplify raw_atomic_long*() definitions
locking/atomic: scripts: split pfx/name/sfx/order
locking/atomic: scripts: restructure fallback ifdeffery
locking/atomic: scripts: build raw_atomic_long*() directly
locking/atomic: treewide: use raw_atomic*_<op>()
locking/atomic: scripts: add trivial raw_atomic*_<op>()
locking/atomic: scripts: factor out order template generation
locking/atomic: scripts: remove leftover "${mult}"
locking/atomic: scripts: remove bogus order parameter
locking/atomic: xtensa: add preprocessor symbols
locking/atomic: x86: add preprocessor symbols
locking/atomic: sparc: add preprocessor symbols
locking/atomic: sh: add preprocessor symbols
...
- Scheduler SMP load-balancer improvements:
- Avoid unnecessary migrations within SMT domains on hybrid systems.
Problem:
On hybrid CPU systems, (processors with a mixture of higher-frequency
SMT cores and lower-frequency non-SMT cores), under the old code
lower-priority CPUs pulled tasks from the higher-priority cores if
more than one SMT sibling was busy - resulting in many unnecessary
task migrations.
Solution:
The new code improves the load balancer to recognize SMT cores with more
than one busy sibling and allows lower-priority CPUs to pull tasks, which
avoids superfluous migrations and lets lower-priority cores inspect all SMT
siblings for the busiest queue.
- Implement the 'runnable boosting' feature in the EAS balancer: consider CPU
contention in frequency, EAS max util & load-balance busiest CPU selection.
This improves CPU utilization for certain workloads, while leaves other key
workloads unchanged.
- Scheduler infrastructure improvements:
- Rewrite the scheduler topology setup code by consolidating it
into the build_sched_topology() helper function and building
it dynamically on the fly.
- Resolve the local_clock() vs. noinstr complications by rewriting
the code: provide separate sched_clock_noinstr() and
local_clock_noinstr() functions to be used in instrumentation code,
and make sure it is all instrumentation-safe.
- Fixes:
- Fix a kthread_park() race with wait_woken()
- Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
- Fix UP PREEMPT bug by unifying the SMP and UP implementations.
- Fix task_struct::saved_state handling.
- Fix various rq clock update bugs, unearthed by turning on the rq clock
debugging code.
- Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by
creating enough cgroups, by removing the warnign and restricting
window size triggers to PSI file write-permission or CAP_SYS_RESOURCE.
- Propagate SMT flags in the topology when removing degenerate domain
- Fix grub_reclaim() calculation bug in the deadline scheduler code
- Avoid resetting the min update period when it is unnecessary, in
psi_trigger_destroy().
- Don't balance a task to its current running CPU in load_balance(),
which was possible on certain NUMA topologies with overlapping
groups.
- Fix the sched-debug printing of rq->nr_uninterruptible
- Cleanups:
- Address various -Wmissing-prototype warnings, as a preparation
to (maybe) enable this warning in the future.
- Remove unused code
- Mark more functions __init
- Fix shadow-variable warnings
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSatWQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j62xAAuGOx1LcDfRGC6WGQzp1zOdlsVQtnDvlS
qL58zYSHgizprpVQ3j87SBaG4CHCdvd2Bo36yW0lNZS4nd203qdq7fkrMb3hPP/w
egUQUzMegf5fF6BWldKeMjuHSt+twFQz/ZAKK8iSbAir6CHNAqbNst1oL0i/+Tyk
o33hBs1hT5tnbFb1NSVZkX4k+qT3LzTW4K2QgjjGtkScr6yHh2BdEVefyigWOjdo
9s02d00ll9a2r+F5txlN7Dnw6TN7rmTXGMOJU5bZvBE90/anNiAorMXHJdEKCyUR
u9+JtBdJWiCplGa/tSRcxT16ZW1VdtTnd9q66TDhXREd2UNDFqBEyg5Wl77K4Tlf
vKFajmj/to+cTbuv6m6TVR+zyXpdEpdL6F04P44U3qiJvDobBqeDNKHHIqpmbHXl
AXUXcPWTVAzXX1Ce5M+BeAgTBQ1T7C5tELILrTNQHJvO1s9VVBRFZ/l65Ps4vu7T
wIZ781IFuopk0zWqHovNvgKrJ7oFmOQQZFttQEe8n6nafkjI7u+IZ8FayiGaUMRr
4GawFGUCEdYh8z9qyslGKe8Q/Rphfk6hxMFRYUJpDmubQ0PkMeDjDGq77jDGl1PF
VqwSDEyOaBJs7Gqf/mem00JtzBmXhkhm1SEjggHMI2IQbr/eeBXoLQOn3CDapO/N
PiDbtX760ic=
=EWQA
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Scheduler SMP load-balancer improvements:
- Avoid unnecessary migrations within SMT domains on hybrid systems.
Problem:
On hybrid CPU systems, (processors with a mixture of
higher-frequency SMT cores and lower-frequency non-SMT cores),
under the old code lower-priority CPUs pulled tasks from the
higher-priority cores if more than one SMT sibling was busy -
resulting in many unnecessary task migrations.
Solution:
The new code improves the load balancer to recognize SMT cores
with more than one busy sibling and allows lower-priority CPUs
to pull tasks, which avoids superfluous migrations and lets
lower-priority cores inspect all SMT siblings for the busiest
queue.
- Implement the 'runnable boosting' feature in the EAS balancer:
consider CPU contention in frequency, EAS max util & load-balance
busiest CPU selection.
This improves CPU utilization for certain workloads, while leaves
other key workloads unchanged.
Scheduler infrastructure improvements:
- Rewrite the scheduler topology setup code by consolidating it into
the build_sched_topology() helper function and building it
dynamically on the fly.
- Resolve the local_clock() vs. noinstr complications by rewriting
the code: provide separate sched_clock_noinstr() and
local_clock_noinstr() functions to be used in instrumentation code,
and make sure it is all instrumentation-safe.
Fixes:
- Fix a kthread_park() race with wait_woken()
- Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
- Fix UP PREEMPT bug by unifying the SMP and UP implementations
- Fix task_struct::saved_state handling
- Fix various rq clock update bugs, unearthed by turning on the rq
clock debugging code.
- Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger
by creating enough cgroups, by removing the warnign and restricting
window size triggers to PSI file write-permission or
CAP_SYS_RESOURCE.
- Propagate SMT flags in the topology when removing degenerate domain
- Fix grub_reclaim() calculation bug in the deadline scheduler code
- Avoid resetting the min update period when it is unnecessary, in
psi_trigger_destroy().
- Don't balance a task to its current running CPU in load_balance(),
which was possible on certain NUMA topologies with overlapping
groups.
- Fix the sched-debug printing of rq->nr_uninterruptible
Cleanups:
- Address various -Wmissing-prototype warnings, as a preparation to
(maybe) enable this warning in the future.
- Remove unused code
- Mark more functions __init
- Fix shadow-variable warnings"
* tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()
sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
sched/core: Fixed missing rq clock update before calling set_rq_offline()
sched/deadline: Update GRUB description in the documentation
sched/deadline: Fix bandwidth reclaim equation in GRUB
sched/wait: Fix a kthread_park race with wait_woken()
sched/topology: Mark set_sched_topology() __init
sched/fair: Rename variable cpu_util eff_util
arm64/arch_timer: Fix MMIO byteswap
sched/fair, cpufreq: Introduce 'runnable boosting'
sched/fair: Refactor CPU utilization functions
cpuidle: Use local_clock_noinstr()
sched/clock: Provide local_clock_noinstr()
x86/tsc: Provide sched_clock_noinstr()
clocksource: hyper-v: Provide noinstr sched_clock()
clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX
x86/vdso: Fix gettimeofday masking
math64: Always inline u128 version of mul_u64_u64_shr()
s390/time: Provide sched_clock_noinstr()
loongarch: Provide noinstr sched_clock_read()
...
WARN and continue if misc_cg_set_capacity() fails, as the only scenario
in which it can fail is if the specified resource is invalid, which should
never happen when CONFIG_KVM_AMD_SEV=y. Deliberately not bailing "fixes"
a theoretical bug where KVM would leak the ASID bitmaps on failure, which
again can't happen.
If the impossible should happen, the end result is effectively the same
with respect to SEV and SEV-ES (they are unusable), while continuing on
has the advantage of letting KVM load, i.e. userspace can still run
non-SEV guests.
Reported-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Link: https://lore.kernel.org/r/20230607004449.1421131-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a "never" option to the nx_huge_pages module param to allow userspace
to do a one-way hard disabling of the mitigation, and don't create the
per-VM recovery threads when the mitigation is hard disabled. Letting
userspace pinky swear that userspace doesn't want to enable NX mitigation
(without reloading KVM) allows certain use cases to avoid the latency
problems associated with spawning a kthread for each VM.
E.g. in FaaS use cases, the guest kernel is trusted and the host may
create 100+ VMs per logical CPU, which can result in 100ms+ latencies when
a burst of VMs is created.
Reported-by: Li RongQing <lirongqing@baidu.com>
Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@baidu.com
Cc: Yong He <zhuangel570@gmail.com>
Cc: Robert Hoo <robert.hoo.linux@gmail.com>
Cc: Kai Huang <kai.huang@intel.com>
Reviewed-by: Robert Hoo <robert.hoo.linux@gmail.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Luiz Capitulino <luizcap@amazon.com>
Reviewed-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/r/20230602005859.784190-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Refresh comments about msrs_to_save, emulated_msrs, and msr_based_features
to remove stale references left behind by commit 2374b7310b66 (KVM:
x86/pmu: Use separate array for defining "PMU MSRs to save"), and to
better reflect the current reality, e.g. emulated_msrs is no longer just
for MSRs that are "kvm-specific".
Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230607004636.1421424-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new
performance monitoring features for AMD processors.
Bit 0 of EAX indicates support for Performance Monitoring Version 2
(PerfMonV2) features. If found to be set during PMU initialization,
the EBX bits of the same CPUID function can be used to determine
the number of available PMCs for different PMU types.
Expose the relevant bits via KVM_GET_SUPPORTED_CPUID so that
guests can make use of the PerfMonV2 features.
Co-developed-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
If AMD Performance Monitoring Version 2 (PerfMonV2) is detected by
the guest, it can use a new scheme to manage the Core PMCs using the
new global control and status registers.
In addition to benefiting from the PerfMonV2 functionality in the same
way as the host (higher precision), the guest also can reduce the number
of vm-exits by lowering the total number of MSRs accesses.
In terms of implementation details, amd_is_valid_msr() is resurrected
since three newly added MSRs could not be mapped to one vPMC.
The possibility of emulating PerfMonV2 on the mainframe has also
been eliminated for reasons of precision.
Co-developed-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: drop "Based on the observed HW." comments]
Link: https://lore.kernel.org/r/20230603011058.1038821-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a KVM-only leaf for AMD's PerfMonV2 to redirect the kernel's scattered
version to its architectural location, e.g. so that KVM can query guest
support via guest_cpuid_has().
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230603011058.1038821-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Cap the number of general purpose counters enumerated on AMD to what KVM
actually supports, i.e. don't allow userspace to coerce KVM into thinking
there are more counters than actually exist, e.g. by enumerating
X86_FEATURE_PERFCTR_CORE in guest CPUID when its not supported.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230603011058.1038821-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Enable and advertise PERFCTR_CORE if and only if the minimum number of
required counters are available, i.e. if perf says there are less than six
general purpose counters.
Opportunistically, use kvm_cpu_cap_check_and_set() instead of open coding
the check for host support.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: massage shortlog and changelog]
Link: https://lore.kernel.org/r/20230603011058.1038821-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disable PMU support when running on AMD and perf reports fewer than four
general purpose counters. All AMD PMUs must define at least four counters
due to AMD's legacy architecture hardcoding the number of counters
without providing a way to enumerate the number of counters to software,
e.g. from AMD's APM:
The legacy architecture defines four performance counters (PerfCtrn)
and corresponding event-select registers (PerfEvtSeln).
Virtualizing fewer than four counters can lead to guest instability as
software expects four counters to be available. Rather than bleed AMD
details into the common code, just define a const unsigned int and
provide a convenient location to document why Intel and AMD have different
mins (in particular, AMD's lack of any way to enumerate less than four
counters to the guest).
Keep the minimum number of counters at Intel at one, even though old P6
and Core Solo/Duo processor effectively require a minimum of two counters.
KVM can, and more importantly has up until this point, supported a vPMU so
long as the CPU has at least one counter. Perf's support for P6/Core CPUs
does require two counters, but perf will happily chug along with a single
counter when running on a modern CPU.
Cc: Jim Mattson <jmattson@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: set Intel min to '1', not '2']
Link: https://lore.kernel.org/r/20230603011058.1038821-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add an explicit !enable_pmu check as relying on kvm_pmu_cap to be
zeroed isn't obvious. Although when !enable_pmu, KVM will have
zero-padded kvm_pmu_cap to do subsequent CPUID leaf assignments.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the Intel PMU implementation of pmc_is_enabled() to common x86 code
as pmc_is_globally_enabled(), and drop AMD's implementation. AMD PMU
currently supports only v1, and thus not PERF_GLOBAL_CONTROL, thus the
semantics for AMD are unchanged. And when support for AMD PMU v2 comes
along, the common behavior will also Just Work.
Signed-off-by: Like Xu <likexu@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the handling of GLOBAL_CTRL, GLOBAL_STATUS, and GLOBAL_OVF_CTRL,
a.k.a. GLOBAL_STATUS_RESET, from Intel PMU code to generic x86 PMU code.
AMD PerfMonV2 defines three registers that have the same semantics as
Intel's variants, just with different names and indices. Conveniently,
since KVM virtualizes GLOBAL_CTRL on Intel only for PMU v2 and above, and
AMD's version shows up in v2, KVM can use common code for the existence
check as well.
Signed-off-by: Like Xu <likexu@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reject userspace writes to MSR_CORE_PERF_GLOBAL_STATUS that attempt to set
reserved bits. Allowing userspace to stuff reserved bits doesn't harm KVM
itself, but it's architecturally wrong and the guest can't clear the
unsupported bits, e.g. makes the guest's PMI handler very confused.
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: rewrite changelog to avoid use of #GP, rebase on name change]
Link: https://lore.kernel.org/r/20230603011058.1038821-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move reprogram_counters() out of Intel specific PMU code and into pmu.h so
that it can be used to implement AMD PMU v2 support.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: rewrite changelog]
Link: https://lore.kernel.org/r/20230603011058.1038821-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename global_ovf_ctrl_mask to global_status_mask to avoid confusion now
that Intel has renamed GLOBAL_OVF_CTRL to GLOBAL_STATUS_RESET in PMU v4.
GLOBAL_OVF_CTRL and GLOBAL_STATUS_RESET are the same MSR index, i.e. are
just different names for the same thing, but the SDM provides different
entries in the IA-32 Architectural MSRs table, which gets really confusing
when looking at PMU v4 definitions since it *looks* like GLOBAL_STATUS has
bits that don't exist in GLOBAL_OVF_CTRL, but in reality the bits are
simply defined in the GLOBAL_STATUS_RESET entry.
No functional change intended.
Cc: Like Xu <like.xu.linux@gmail.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
As test_bit() returns bool, explicitly converting result to bool is
unnecessary. Get rid of '!!'.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230605200158.118109-1-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
Replace an #ifdef on CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS with a
cpu_feature_enabled() check on X86_FEATURE_PKU. The macro magic of
DISABLED_MASK_BIT_SET() means that cpu_feature_enabled() provides the
same end result (no code generated) when PKU is disabled by Kconfig.
No functional change intended.
Cc: Jon Kohler <jon@nutanix.com>
Reviewed-by: Jon Kohler <jon@nutanix.com>
Link: https://lore.kernel.org/r/20230602010550.785722-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Request an APIC-access page reload when the backing page is migrated (or
unmapped) if and only if vendor code actually plugs the backing pfn into
structures that reside outside of KVM's MMU. This avoids kicking all
vCPUs in the (hopefully infrequent) scenario where the backing page is
migrated/invalidated.
Unlike VMX's APICv, SVM's AVIC doesn't plug the backing pfn directly into
the VMCB and so doesn't need a hook to invalidate an out-of-MMU "mapping".
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230602011518.787006-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Now that KVM honors past and in-progress mmu_notifier invalidations when
reloading the APIC-access page, use KVM's "standard" invalidation hooks
to trigger a reload and delete the one-off usage of invalidate_range().
Aside from eliminating one-off code in KVM, dropping KVM's use of
invalidate_range() will allow common mmu_notifier to redefine the API to
be more strictly focused on invalidating secondary TLBs that share the
primary MMU's page tables.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230602011518.787006-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Re-request an APIC-access page reload if there is a relevant mmu_notifier
invalidation in-progress when KVM retrieves the backing pfn, i.e. stall
vCPUs until the backing pfn for the APIC-access page is "officially"
stable. Relying on the primary MMU to not make changes after invoking
->invalidate_range() works, e.g. any additional changes to a PRESENT PTE
would also trigger an ->invalidate_range(), but using ->invalidate_range()
to fudge around KVM not honoring past and in-progress invalidations is a
bit hacky.
Honoring invalidations will allow using KVM's standard mmu_notifier hooks
to detect APIC-access page reloads, which will in turn allow removing
KVM's implementation of ->invalidate_range() (the APIC-access page case is
a true one-off).
Opportunistically add a comment to explain why doing nothing if a memslot
isn't found is functionally correct.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230602011518.787006-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Let's print available ASID ranges for SEV/SEV-ES guests.
This information can be useful for system administrator
to debug if SEV/SEV-ES fails to enable.
There are a few reasons.
SEV:
- NPT is disabled (module parameter)
- CPU lacks some features (sev, decodeassists)
- Maximum SEV ASID is 0
SEV-ES:
- mmio_caching is disabled (module parameter)
- CPU lacks sev_es feature
- Minimum SEV ASID value is 1 (can be adjusted in BIOS/UEFI)
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Link: https://lore.kernel.org/r/20230522161249.800829-3-aleksandr.mikhalitsyn@canonical.com
[sean: print '0' for min SEV-ES ASID if there are no available ASIDs]
Signed-off-by: Sean Christopherson <seanjc@google.com>
There is no VMENTER_L1D_FLUSH_NESTED_VM. It should be
ARCH_CAP_SKIP_VMENTRY_L1DFLUSH.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230524061634.54141-3-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Currently hv_read_tsc_page_tsc() (ab)uses the (valid) time value of
U64_MAX as an error return. This breaks the clean wrap-around of the
clock.
Modify the function signature to return a boolean state and provide
another u64 pointer to store the actual time on success. This obviates
the need to steal one time value and restores the full counter width.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.775630881@infradead.org
Now that we have raw_atomic*_<op>() definitions, there's no need to use
arch_atomic*_<op>() definitions outside of the low-level atomic
definitions.
Move treewide users of arch_atomic*_<op>() over to the equivalent
raw_atomic*_<op>().
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230605070124.3741859-19-mark.rutland@arm.com
Bail from kvm_recalculate_phys_map() and disable the optimized map if the
target vCPU's x2APIC ID is out-of-bounds, i.e. if the vCPU was added
and/or enabled its local APIC after the map was allocated. This fixes an
out-of-bounds access bug in the !x2apic_format path where KVM would write
beyond the end of phys_map.
Check the x2APIC ID regardless of whether or not x2APIC is enabled,
as KVM's hardcodes x2APIC ID to be the vCPU ID, i.e. it can't change, and
the map allocation in kvm_recalculate_apic_map() doesn't check for x2APIC
being enabled, i.e. the check won't get false postivies.
Note, this also affects the x2apic_format path, which previously just
ignored the "x2apic_id > new->max_apic_id" case. That too is arguably a
bug fix, as ignoring the vCPU meant that KVM would not send interrupts to
the vCPU until the next map recalculation. In practice, that "bug" is
likely benign as a newly present vCPU/APIC would immediately trigger a
recalc. But, there's no functional downside to disabling the map, and
a future patch will gracefully handle the -E2BIG case by retrying instead
of simply disabling the optimized map.
Opportunistically add a sanity check on the xAPIC ID size, along with a
comment explaining why the xAPIC ID is guaranteed to be "good".
Reported-by: Michal Luczaj <mhal@rbox.co>
Fixes: 5b84b0291702 ("KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDs")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230602233250.1014316-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move SVM's call to trace_kvm_exit() from the "slow" VM-Exit handler to
svm_vcpu_run() so that KVM traces fastpath VM-Exits that re-enter the
guest without bouncing through the slow path. This bug is benign in the
current code base as KVM doesn't currently support any such exits on SVM.
Fixes: a9ab13ff6e84 ("KVM: X86: Improve latency for single target IPI fastpath")
Link: https://lore.kernel.org/r/20230602011920.787844-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Increment vcpu->stat.exits when handling a fastpath VM-Exit without
going through any part of the "slow" path. Not bumping the exits stat
can result in wildly misleading exit counts, e.g. if the primary reason
the guest is exiting is to program the TSC deadline timer.
Fixes: 404d5d7bff0d ("KVM: X86: Introduce more exit_fastpath_completion enum values")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230602011920.787844-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>