IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
When requesting an immediate exit from L2 in order to inject a pending
event, do so only if the pending event actually requires manual injection,
i.e. if and only if KVM actually needs to regain control in order to
deliver the event.
Avoiding the "immediate exit" isn't simply an optimization, it's necessary
to make forward progress, as the "already expired" VMX preemption timer
trick that KVM uses to force a VM-Exit has higher priority than events
that aren't directly injected.
At present time, this is a glorified nop as all events processed by
vmx_has_nested_events() require injection, but that will not hold true in
the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI.
I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired
VMX preemption timer will trigger VM-Exit before the virtual interrupt is
delivered, and KVM will effectively hang the vCPU in an endless loop of
forced immediate VM-Exits (because the pending virtual interrupt never
goes away).
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a helper to retrieve the highest pending vector given a Posted
Interrupt descriptor. While the actual operation is straightforward, it's
surprisingly easy to mess up, e.g. if one tries to reuse lapic.c's
find_highest_vector(), which doesn't work with PID.PIR due to the APIC's
IRR and ISR component registers being physically discontiguous (they're
4-byte registers aligned at 16-byte intervals).
To make PIR handling more consistent with respect to IRR and ISR handling,
return -1 to indicate "no interrupt pending".
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
In the vmx_init() error handling path, the __vmx_exit() is done before
kvm_x86_vendor_exit(). They should follow the same order in vmx_exit().
But currently __vmx_exit() is done after kvm_x86_vendor_exit() in
vmx_exit(). Switch the order of them to fix.
Fixes: e32b120071ea ("KVM: VMX: Do _all_ initialization before exposing /dev/kvm to userspace")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240627010524.3732488-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove the completely pointess global INVEPT, i.e. EPT TLB flush, from
KVM's VMX enablement path. KVM always does a targeted TLB flush when
using a "new" EPT root, in quotes because "new" simply means a root that
isn't currently being used by the vCPU.
KVM also _deliberately_ runs with stale TLB entries for defunct roots,
i.e. doesn't do a TLB flush when vCPUs stop using roots, precisely because
KVM does the flush on first use. As called out by the comment in
kvm_mmu_load(), the reason KVM flushes on first use is because KVM can't
guarantee the correctness of past hypervisors.
Jumping back to the global INVEPT, when the painfully terse commit
1439442c7b25 ("KVM: VMX: Enable EPT feature for KVM") was added, the
effective TLB flush being performed was:
static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
{
vpid_sync_vcpu_all(to_vmx(vcpu));
}
I.e. KVM was not flushing EPT TLB entries when allocating a "new" root,
which very strongly suggests that the global INVEPT during hardware
enabling was a misguided hack that addressed the most obvious symptom,
but failed to fix the underlying bug.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240608001003.3296640-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rewrite the comment above VMCS12_REVISION to unequivocally state that the
ID must never change. KVM_{G,S}ET_NESTED_STATE have been officially
supported for some time now, i.e. changing VMCS12_REVISION would break
userspace.
Opportunistically add a blurb to the CHECK_OFFSET() comment to make it
explicitly clear that new fields are allowed, i.e. that the restriction
on the layout is all about backwards compatibility.
No functional change intended.
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240613190103.1054877-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
An unintended consequence of commit 9c573cd31343 ("randomize_kstack:
Improve entropy diffusion") was that the per-architecture entropy size
filtering reduced how many bits were being added to the mix, rather than
how many bits were being used during the offsetting. All architectures
fell back to the existing default of 0x3FF (10 bits), which will consume
at most 1KiB of stack space. It seems that this is working just fine,
so let's avoid the confusion and update everything to use the default.
The prior intent of the per-architecture limits were:
arm64: capped at 0x1FF (9 bits), 5 bits effective
powerpc: uncapped (10 bits), 6 or 7 bits effective
riscv: uncapped (10 bits), 6 bits effective
x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective
s390: capped at 0xFF (8 bits), undocumented effective entropy
Current discussion has led to just dropping the original per-architecture
filters. The additional entropy appears to be safe for arm64, x86,
and s390. Quoting Arnd, "There is no point pretending that 15.75KB is
somehow safe to use while 15.00KB is not."
Co-developed-by: Yuntao Liu <liuyuntao12@huawei.com>
Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
Fixes: 9c573cd31343 ("randomize_kstack: Improve entropy diffusion")
Link: https://lore.kernel.org/r/20240617133721.377540-1-liuyuntao12@huawei.com
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Link: https://lore.kernel.org/r/20240619214711.work.953-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
Use sev_es_host_save_area() instead of open coding an equivalent when
setting the MSR_TSC_AUX field during setup.
No functional change intended.
Link: https://lore.kernel.org/r/20240617210432.1642542-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Force sev_es_host_save_area() to be always inlined, as it's used in the
low level VM-Enter/VM-Exit path, which is non-instrumentable.
vmlinux.o: warning: objtool: svm_vcpu_enter_exit+0xb0: call to
sev_es_host_save_area() leaves .noinstr.text section
vmlinux.o: warning: objtool: svm_vcpu_enter_exit+0xbf: call to
sev_es_host_save_area.isra.0() leaves .noinstr.text section
Fixes: c92be2fd8edf ("KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area")
Reported-by: Borislav Petkov <bp@alien8.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240617210432.1642542-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add module descriptions for the vendor modules to fix allmodconfig
'make W=1' warnings:
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-intel.o
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-amd.o
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240622-md-kvm-v2-1-29a60f7c48b1@quicinc.com
[sean: split kvm.ko change to separate commit]
Signed-off-by: Sean Christopherson <seanjc@google.com>
In cloud environments it can be useful to *only* enable the vmexit
mitigation and leave syscalls vulnerable. Add that as an option.
This is similar to the old spectre_bhi=auto option which was removed
with the following commit:
36d4fe147c87 ("x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto")
with the main difference being that this has a more descriptive name and
is disabled by default.
Mitigation switch requested by Maksim Davydov <davydov-max@yandex-team.ru>.
[ bp: Massage. ]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/2cbad706a6d5e1da2829e5e123d8d5c80330148c.1719381528.git.jpoimboe@kernel.org
The direct-call syscall dispatch function doesn't know that the exit()
and exit_group() syscall handlers don't return, so the call sites aren't
optimized accordingly.
Fix that by marking the exit syscall declarations __noreturn.
Fixes the following warnings:
vmlinux.o: warning: objtool: x64_sys_call+0x2804: __x64_sys_exit() is missing a __noreturn annotation
vmlinux.o: warning: objtool: ia32_sys_call+0x29b6: __ia32_sys_exit_group() is missing a __noreturn annotation
Fixes: 1e3ad78334a6 ("x86/syscall: Don't force use of indirect calls for system calls")
Closes: https://lkml.kernel.org/lkml/6dba9b32-db2c-4e6d-9500-7a08852f17a3@paulmck-laptop
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/5d8882bc077d8eadcc7fd1740b56dfb781f12288.1719381528.git.jpoimboe@kernel.org
allows controlling which CPU cores can operate above nominal
frequencies for short periods of time.
-----BEGIN PGP SIGNATURE-----
iQJOBAABCgA4FiEECwtuSU6dXvs5GA2aLRkspiR3AnYFAmZ8fvEaHG1hcmlvLmxp
bW9uY2llbGxvQGFtZC5jb20ACgkQLRkspiR3AnbmPQ/9FrmbFf8t0e7WQJ7wrlHz
8HmeHGLNLQbOWLrNPP2um+33i97hxJ+h8RHnPnr9wzdDl1R+u2oR1vu5DXCYgBBA
d9rLJv1YSFnEu9VPklAHWMyHHb+F6OsUyk6yPl8R50j2E3HOb/TjwLxIfxC0C80p
ox2ffArMfO5iKEAcVkpKQuh0prWDoxl4eQ8UI2DoKLMu1UyZRmH/jWL8l1qNGpwF
4nRwYl4xERF2qnMaszN+QZREirmXwzU5y1gylx25qKDpFwzotulkEyQDGVPfqBr2
kTz0mvc+i1mrJ2P5MG5gi1Mgsxd5dA1VPhYDk+4vgE+oPnJp3kdBtOKWfkmN+mgn
PB6gFMWJFpLm/Kl4Lu8TS3m+aE0Euctcu/pVYEhxeP5bEJ82gbxgT9/kd2hfMtMi
6QbBTIpoJcLnUuMEaOXRYlpuAmaG3cp/gDI4UO8tid+BgoyGbOK8fkToL2s0mIFx
JrH19ZBAEXSWcoMQVmY118H8Uy4J+1IsA4IlZweTV0ZQPQ/W8VQ2blfyvRo7dSGj
JtGhkOYtXdtYKahqC06fyi5lfzy+huiLjQElBOFWTl5x+usntpeCuJrG2kZ/gAiS
gxVsL1FX6J7cxN866ty3jdwNYwOt5/JwG/oq3buBCeKYobQB3qY9bK6V42siC+Qv
bcmTcy0lrfzZoNW5fvo3JZE=
=EZq/
-----END PGP SIGNATURE-----
Merge tag 'amd-pstate-v6.11-2024-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux
Merge more amd-pstate driver updates for 6.11 from Mario Limonciello:
"Add support for amd-pstate core performance boost support which
allows controlling which CPU cores can operate above nominal
frequencies for short periods of time."
* tag 'amd-pstate-v6.11-2024-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux:
Documentation: cpufreq: amd-pstate: update doc for Per CPU boost control method
cpufreq: amd-pstate: Cap the CPPC.max_perf to nominal_perf if CPB is off
cpufreq: amd-pstate: initialize core precision boost state
cpufreq: acpi: move MSR_K7_HWCR_CPB_DIS_BIT into msr-index.h
VMware hypercalls use I/O port, VMCALL or VMMCALL instructions. Add a call to
__tdx_hypercall() in order to support TDX guests.
No change in high bandwidth hypercalls, as only low bandwidth ones are supported
for TDX guests.
[ bp: Massage, clear on-stack struct tdx_module_args variable. ]
Co-developed-by: Tim Merrifield <tim.merrifield@broadcom.com>
Signed-off-by: Tim Merrifield <tim.merrifield@broadcom.com>
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-9-alexey.makhalov@broadcom.com
No more direct use of these macros should be allowed. The vmware_hypercallX API
still uses the new implementation of VMWARE_HYPERCALL macro internally, but it
is not exposed outside of the vmware.h.
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-8-alexey.makhalov@broadcom.com
VCPU_RESERVED and LEGACY_X2APIC are not VMware hypercall commands. These are
bits in the return value of the VMWARE_CMD_GETVCPU_INFO command. Change
VMWARE_CMD_ prefix to GETVCPU_INFO_ one. And move the bit-shift
operation into the macro body.
Fixes: 4cca6ea04d31c ("x86/apic: Allow x2apic without IR on VMware platform")
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-7-alexey.makhalov@broadcom.com
Remove VMWARE_CMD macro and move to vmware_hypercall API.
No functional changes intended.
Use u32/u64 instead of uint32_t/uint64_t across the file.
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-6-alexey.makhalov@broadcom.com
Introduce a vmware_hypercall family of functions. It is a common implementation
to be used by the VMware guest code and virtual device drivers in architecture
independent manner.
The API consists of vmware_hypercallX and vmware_hypercall_hb_{out,in}
set of functions analogous to KVM's hypercall API. Architecture-specific
implementation is hidden inside.
It will simplify future enhancements in VMware hypercalls such as SEV-ES and
TDX related changes without needs to modify a caller in device drivers code.
Current implementation extends an idea from
bac7b4e84323 ("x86/vmware: Update platform detection code for VMCALL/VMMCALL hypercalls")
to have a slow, but safe path vmware_hypercall_slow() earlier during the boot
when alternatives are not yet applied. The code inherits VMWARE_CMD logic from
the commit mentioned above.
Move common macros from vmware.c to vmware.h.
[ bp: Fold in a fix:
https://lore.kernel.org/r/20240625083348.2299-1-alexey.makhalov@broadcom.com ]
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-2-alexey.makhalov@broadcom.com
Using sys_io_pgetevents() as the entry point for compat mode tasks
works almost correctly, but misses the sign extension for the min_nr
and nr arguments.
This was addressed on parisc by switching to
compat_sys_io_pgetevents_time64() in commit 6431e92fc827 ("parisc:
io_pgetevents_time64() needs compat syscall in 32-bit compat mode"),
as well as by using more sophisticated system call wrappers on x86 and
s390. However, arm64, mips, powerpc, sparc and riscv still have the
same bug.
Change all of them over to use compat_sys_io_pgetevents_time64()
like parisc already does. This was clearly the intention when the
function was originally added, but it got hooked up incorrectly in
the tables.
Cc: stable@vger.kernel.org
Fixes: 48166e6ea47d ("y2038: add 64-bit time_t syscalls to all 32-bit architectures")
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
When called with a 'from' that is not 4-byte-aligned, string_memcpy_fromio()
calls the movs() macro to copy the first few bytes, so that 'from' becomes
4-byte-aligned before calling rep_movs(). This movs() macro modifies 'to', and
the subsequent line modifies 'n'.
As a result, on unaligned accesses, kmsan_unpoison_memory() uses the updated
(aligned) values of 'to' and 'n'. Hence, it does not unpoison the entire
region.
Save the original values of 'to' and 'n', and pass those to
kmsan_unpoison_memory(), so that the entire region is unpoisoned.
Signed-off-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/r/20240523215029.4160518-1-bjohannesmeyer@gmail.com
iosf_mbi_pci_{read,write}_mdr() use pci_{read,write}_config_dword()
that return PCIBIOS_* codes but functions also return -ENODEV which are
not compatible error codes. As neither of the functions are related to
PCI read/write functions, they should return normal errnos.
Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal
errno before returning it.
Fixes: 46184415368a ("arch: x86: New MailBox support driver for Intel SOC's")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240527125538.13620-4-ilpo.jarvinen@linux.intel.com
xen_pcifront_enable_irq() uses pci_read_config_byte() that returns
PCIBIOS_* codes. The error handling, however, assumes the codes are
normal errnos because it checks for < 0.
xen_pcifront_enable_irq() also returns the PCIBIOS_* code back to the
caller but the function is used as the (*pcibios_enable_irq) function
which should return normal errnos.
Convert the error check to plain non-zero check which works for
PCIBIOS_* return codes and convert the PCIBIOS_* return code using
pcibios_err_to_errno() into normal errno before returning it.
Fixes: 3f2a230caf21 ("xen: handled remapped IRQs when enabling a pcifront PCI device.")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20240527125538.13620-3-ilpo.jarvinen@linux.intel.com
intel_mid_pci_irq_enable() uses pci_read_config_byte() that returns
PCIBIOS_* codes. The error handling, however, assumes the codes are
normal errnos because it checks for < 0.
intel_mid_pci_irq_enable() also returns the PCIBIOS_* code back to the
caller but the function is used as the (*pcibios_enable_irq) function
which should return normal errnos.
Convert the error check to plain non-zero check which works for
PCIBIOS_* return codes and convert the PCIBIOS_* return code using
pcibios_err_to_errno() into normal errno before returning it.
Fixes: 5b395e2be6c4 ("x86/platform/intel-mid: Make IRQ allocation a bit more flexible")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240527125538.13620-2-ilpo.jarvinen@linux.intel.com
x86_of_pci_irq_enable() returns PCIBIOS_* code received from
pci_read_config_byte() directly and also -EINVAL which are not
compatible error types. x86_of_pci_irq_enable() is used as
(*pcibios_enable_irq) function which should not return PCIBIOS_* codes.
Convert the PCIBIOS_* return code from pci_read_config_byte() into
normal errno using pcibios_err_to_errno().
Fixes: 96e0a0797eba ("x86: dtb: Add support for PCI devices backed by dtb nodes")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240527125538.13620-1-ilpo.jarvinen@linux.intel.com
- A randconfig build fix for the VMware virtual GPU driver
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmZ32zQACgkQEsHwGGHe
VUqVdA//fjAlr3ZA85b+BrSV0WgGYb5BVCdZ9W2NYnoWe771VcNKM/83skoVudFu
jKEYuguLPizQrMgiQKKBNFwhLj5X7/DeF7Wicl7BcIQ4RFt03QfwkGSsHvXxq1p8
fcGy6WweBKaPVdCi9/UunbDOQGb8YK/gog7jR2J/tT+rPNFyO9YWRREfPE+/6Jso
gVv71HWtUINzJkEwcW5E6RACCrcYLdlYZwpdf1OQOzprIsLXOc8yAPpks7NywrXY
jn4Lhw31SiySZFuo7DIhlZVESaXvbvaVHw5f4joOvfzSQ+HQhjsoK+hqkqfHEFJJ
JGEyBrXB5J1AZ4AG7Jmm+I04CIhvnl+P8R4VluxpQ6PJTVa/wXoFanHGY79VTQHd
CD5o6STbv4xYSWWq0boI57d96gZDmRY8qbn7tZU+mb1UIoTU2YymkdM/OXeUfdzE
ltbVqJIWxjTFd5Ar07IBFY3swjcOpr0HJ4FWWc2ybSDTK1+h8swaT82BSSJfoLio
tG/7IeJ7ycnVufuzpOY7VbLah4kdM5irhETatkfw18cQAiKZU/AOop+mN1lgRaJ+
RE62KPXwAv/45w1t19oLdIWCd4EqK+PfBHQcqGaIpYe0SMIVtB63r4yuv/lb0AXn
MizuET14Qr1SpXl2Slw6V8973e58gKB45o7vaP6M0cjALuRk7Hc=
=WXpS
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.10_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- An ARM-relevant fix to not free default RMIDs of a resource control
group
- A randconfig build fix for the VMware virtual GPU driver
* tag 'x86_urgent_for_v6.10_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Don't try to free nonexistent RMIDs
drm/vmwgfx: Fix missing HYPERVISOR_GUEST dependency
* Fix dangling references to a redistributor region if the vgic was
prematurely destroyed.
* Properly mark FFA buffers as released, ensuring that both parties
can make forward progress.
x86:
* Allow getting/setting MSRs for SEV-ES guests, if they're using the pre-6.9
KVM_SEV_ES_INIT API.
* Always sync pending posted interrupts to the IRR prior to IOAPIC
route updates, so that EOIs are intercepted properly if the old routing
table requested that.
Generic:
* Avoid __fls(0)
* Fix reference leak on hwpoisoned page
* Fix a race in kvm_vcpu_on_spin() by ensuring loads and stores are atomic.
* Fix bug in __kvm_handle_hva_range() where KVM calls a function pointer
that was intended to be a marker only (nothing bad happens but kind of
a mine and also technically undefined behavior)
* Do not bother accounting allocations that are small and freed before
getting back to userspace.
Selftests:
* Fix compilation for RISC-V.
* Fix a "shift too big" goof in the KVM_SEV_INIT2 selftest.
* Compute the max mappable gfn for KVM selftests on x86 using GuestMaxPhyAddr
from KVM's supported CPUID (if it's available).
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmZ1sNwUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroO8Rwf/ZH+zVOkKdrA0XT71nToc9AkqObPO
mBpV5p+E4boVHSWNQgY7R0yu1ViLc+HotTYf7MoQGeobm60YtDkWHlxcKrQD672C
cLRdl02iRRDGMTRAhpr9jvT/yMHB5kYDxEYmO44nPJKwodcb4/4RJQpt8wyslT2G
uUDpnYMFmSZ8/Zt7IznSEcSx1D+4WFqLT2AZPsJ55w45BFiI+5uRQ/kRaM9iM0+r
yuOQCCK3+pV4CqA+ckbZ6j6+RufcovjEdYCoxLQDOdK6tQTD9aqwJFQ/o2tc+fJT
Hj1MRRsqmdOePdjguBMsfDrEnjXoBveAt96BVheavbpC1UaWp5n0r8p2sA==
=Egkk
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix dangling references to a redistributor region if the vgic was
prematurely destroyed.
- Properly mark FFA buffers as released, ensuring that both parties
can make forward progress.
x86:
- Allow getting/setting MSRs for SEV-ES guests, if they're using the
pre-6.9 KVM_SEV_ES_INIT API.
- Always sync pending posted interrupts to the IRR prior to IOAPIC
route updates, so that EOIs are intercepted properly if the old
routing table requested that.
Generic:
- Avoid __fls(0)
- Fix reference leak on hwpoisoned page
- Fix a race in kvm_vcpu_on_spin() by ensuring loads and stores are
atomic.
- Fix bug in __kvm_handle_hva_range() where KVM calls a function
pointer that was intended to be a marker only (nothing bad happens
but kind of a mine and also technically undefined behavior)
- Do not bother accounting allocations that are small and freed
before getting back to userspace.
Selftests:
- Fix compilation for RISC-V.
- Fix a "shift too big" goof in the KVM_SEV_INIT2 selftest.
- Compute the max mappable gfn for KVM selftests on x86 using
GuestMaxPhyAddr from KVM's supported CPUID (if it's available)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SEV-ES: Fix svm_get_msr()/svm_set_msr() for KVM_SEV_ES_INIT guests
KVM: Discard zero mask with function kvm_dirty_ring_reset
virt: guest_memfd: fix reference leak on hwpoisoned page
kvm: do not account temporary allocations to kmem
MAINTAINERS: Drop Wanpeng Li as a Reviewer for KVM Paravirt support
KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routes
KVM: Stop processing *all* memslots when "null" mmu_notifier handler is found
KVM: arm64: FFA: Release hyp rx buffer
KVM: selftests: Fix RISC-V compilation
KVM: arm64: Disassociate vcpus from redistributor region on teardown
KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin()
KVM: selftests: x86: Prioritize getting max_gfn from GuestPhysBits
KVM: selftests: Fix shift of 32 bit unsigned int more than 32 bits
With commit 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA
encryption"), older VMMs like QEMU 9.0 and older will fail when booting
SEV-ES guests with something like the following error:
qemu-system-x86_64: error: failed to get MSR 0x174
qemu-system-x86_64: ../qemu.git/target/i386/kvm/kvm.c:3950: kvm_get_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
This is because older VMMs that might still call
svm_get_msr()/svm_set_msr() for SEV-ES guests after guest boot even if
those interfaces were essentially just noops because of the vCPU state
being encrypted and stored separately in the VMSA. Now those VMMs will
get an -EINVAL and generally crash.
Newer VMMs that are aware of KVM_SEV_INIT2 however are already aware of
the stricter limitations of what vCPU state can be sync'd during
guest run-time, so newer QEMU for instance will work both for legacy
KVM_SEV_ES_INIT interface as well as KVM_SEV_INIT2.
So when using KVM_SEV_INIT2 it's okay to assume userspace can deal with
-EINVAL, whereas for legacy KVM_SEV_ES_INIT the kernel might be dealing
with either an older VMM and so it needs to assume that returning
-EINVAL might break the VMM.
Address this by only returning -EINVAL if the guest was started with
KVM_SEV_INIT2. Otherwise, just silently return.
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Nikunj A Dadhania <nikunj@amd.com>
Reported-by: Srikanth Aithal <sraithal@amd.com>
Closes: https://lore.kernel.org/lkml/37usuu4yu4ok7be2hqexhmcyopluuiqj3k266z4gajc2rcj4yo@eujb23qc3zcm/
Fixes: 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA encryption")
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240604233510.764949-1-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes a compiler warning. the bpf_jit_binary_pack_finalize function
was taking an extra bpf_prog parameter that went unused.
This removves it and updates the callers accordingly.
Signed-off-by: Rafael Passos <rafael@rcpassos.me>
Link: https://lore.kernel.org/r/20240615022641.210320-2-rafael@rcpassos.me
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
As 'prog->aux->tail_call_reachable' is correct for tail call present,
it's unnecessary to detect tail call in x86 jit.
Therefore, let's remove it.
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240610124224.34673-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pass fault->gfn into kvm_tdp_mmu_fast_pf_get_last_sptep(), instead of
passing fault->addr and then converting it to a GFN.
Future changes will make fault->addr and fault->gfn differ when running
TDX guests. The GFN will be conceptually the same as it is for normal VMs,
but fault->addr may contain a TDX specific bit that differentiates between
"shared" and "private" memory. This bit will be used to direct faults to
be handled on different roots, either the normal "direct" root or a new
type of root that handles private memory. The TDP iterators will process
the traditional GFN concept and apply the required TDX specifics depending
on the root type. For this reason, it needs to operate on regular GFN and
not the addr, which may contain these special TDX specific bits.
Today kvm_tdp_mmu_fast_pf_get_last_sptep() takes fault->addr and then
immediately converts it to a GFN with a bit shift. However, this would
unfortunately retain the TDX specific bits in what is supposed to be a
traditional GFN. Excluding TDX's needs, it is also is unnecessary to pass
fault->addr and convert it to a GFN when the GFN is already on hand.
So instead just pass the GFN into kvm_tdp_mmu_fast_pf_get_last_sptep() and
use it directly.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240619223614.290657-9-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename REMOVED_SPTE to FROZEN_SPTE so that it can be used for other
multi-part operations.
REMOVED_SPTE is used as a non-present intermediate value for multi-part
operations that can happen when a thread doesn't have an MMU write lock.
Today these operations are when removing PTEs.
However, future changes will want to use the same concept for setting a
PTE. In that case the REMOVED_SPTE name does not quite fit. So rename it
to FROZEN_SPTE so it can be used for both types of operations.
Also rename the relevant helpers and comments that refer to "removed"
within the context of the SPTE value. Take care to not update naming
referring the "remove" operations, which are still distinct.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240619223614.290657-2-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The TDP MMU function __tdp_mmu_set_spte_atomic uses a cmpxchg64 to replace
the SPTE value and returns -EBUSY on failure. The caller must check the
return value and retry. Add __must_check to it, as well as to two more
functions that forward the return value of __tdp_mmu_set_spte_atomic to
their caller.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
I'm getting tired of telling people to put a magic "" in the
#define X86_FEATURE /* "" ... */
comment to hide the new feature flag from the user-visible
/proc/cpuinfo.
Flip the logic to make it explicit: an explicit "<name>" in the comment
adds the flag to /proc/cpuinfo and otherwise not, by default.
Add the "<name>" of all the existing flags to keep backwards
compatibility with userspace.
There should be no functional changes resulting from this.
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240618113840.24163-1-bp@kernel.org
An unintended consequence of commit 9c573cd31343 ("randomize_kstack:
Improve entropy diffusion") was that the per-architecture entropy size
filtering reduced how many bits were being added to the mix, rather than
how many bits were being used during the offsetting. All architectures
fell back to the existing default of 0x3FF (10 bits), which will consume
at most 1KiB of stack space. It seems that this is working just fine,
so let's avoid the confusion and update everything to use the default.
The prior intent of the per-architecture limits were:
arm64: capped at 0x1FF (9 bits), 5 bits effective
powerpc: uncapped (10 bits), 6 or 7 bits effective
riscv: uncapped (10 bits), 6 bits effective
x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective
s390: capped at 0xFF (8 bits), undocumented effective entropy
Current discussion has led to just dropping the original per-architecture
filters. The additional entropy appears to be safe for arm64, x86,
and s390. Quoting Arnd, "There is no point pretending that 15.75KB is
somehow safe to use while 15.00KB is not."
Co-developed-by: Yuntao Liu <liuyuntao12@huawei.com>
Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
Fixes: 9c573cd31343 ("randomize_kstack: Improve entropy diffusion")
Link: https://lore.kernel.org/r/20240617133721.377540-1-liuyuntao12@huawei.com
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Link: https://lore.kernel.org/r/20240619214711.work.953-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC
routes, irrespective of whether the I/O APIC is emulated by userspace or
by KVM. If a level-triggered interrupt routed through the I/O APIC is
pending or in-service for a vCPU, KVM needs to intercept EOIs on said
vCPU even if the vCPU isn't the destination for the new routing, e.g. if
servicing an interrupt using the old routing races with I/O APIC
reconfiguration.
Commit fceb3a36c29a ("KVM: x86: ioapic: Fix level-triggered EOI and
userspace I/OAPIC reconfigure race") fixed the common cases, but
kvm_apic_pending_eoi() only checks if an interrupt is in the local
APIC's IRR or ISR, i.e. misses the uncommon case where an interrupt is
pending in the PIR.
Failure to intercept EOI can manifest as guest hangs with Windows 11 if
the guest uses the RTC as its timekeeping source, e.g. if the VMM doesn't
expose a more modern form of time to the guest.
Cc: stable@vger.kernel.org
Cc: Adamos Ttofari <attofari@amazon.de>
Cc: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240611014845.82795-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since FineIBT performs checking at the destination, it is weaker against
attacks that can construct arbitrary executable memory contents. As such,
some system builders want to run with FineIBT disabled by default. Allow
the "cfi=kcfi" boot param mode to be selectable through Kconfig via the
newly introduced CONFIG_CFI_AUTO_DEFAULT.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240501000218.work.998-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
This switches x86-64 over to using 'tzcount' instead of the integer
multiply trick to turn the bytemask information into actual byte counts.
We even had a comment saying that a fast bit count instruction is better
than a multiply, but x86 bit counting has traditionally been
"questionably fast", and so avoiding it was the right thing back in the
days.
Now, on any half-way modern core, using bit counting is cheaper and
smaller than the large constant multiply, so let's just switch over.
Note that as part of switching over to counting bits, we also do it at a
different point. We used to create the byte count from the final byte
mask, but once you use the 'tzcount' instruction (aka 'bsf' on older
CPU's), you can actually count the leading zeroes using a value we have
available earlier.
In fact, we can just use the very first mask of bits that tells us
whether we have any zero bytes at all. The zero bytes in the word will
have the high bit set, so just doing 'tzcount' on that value and
dividing by 8 will give the number of bytes that precede the first NUL
character, which is exactly what we want.
Note also that the input value to the tzcount is by definition not zero,
since that is the condition that we already used to check the whole "do
we have any zero bytes at all". So we don't need to worry about the
legacy instruction behavior of pre-lzcount days when 'bsf' didn't have a
result for zero input.
The 32-bit code continues to use the bimple bit op trick that is faster
even on newer cores, but particularly on the older 32-bit-only ones.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This implements the runtime constant infrastructure for x86, allowing
the dcache d_hash() function to be generated using as a constant for
hash table address followed by shift by a constant of the hash index.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit
6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index")
adds logic to map individual monitoring groups into a global index space used
for tracking allocated RMIDs.
Attempts to free the default RMID are ignored in free_rmid(), and this works
fine on x86.
With arm64 MPAM, there is a latent bug here however: on platforms with no
monitors exposed through resctrl, each control group still gets a different
monitoring group ID as seen by the hardware, since the CLOSID always forms part
of the monitoring group ID.
This means that when removing a control group, the code may try to free this
group's default monitoring group RMID for real. If there are no monitors
however, the RMID tracking table rmid_ptrs[] would be a waste of memory and is
never allocated, leading to a splat when free_rmid() tries to dereference the
table.
One option would be to treat RMID 0 as special for every CLOSID, but this would
be ugly since bookkeeping still needs to be done for these monitoring group IDs
when there are monitors present in the hardware.
Instead, add a gating check of resctrl_arch_mon_capable() in free_rmid(), and
just do nothing if the hardware doesn't have monitors.
This fix mirrors the gating checks already present in
mkdir_rdt_prepare_rmid_alloc() and elsewhere.
No functional change on x86.
[ bp: Massage commit message. ]
Fixes: 6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240618140152.83154-1-Dave.Martin@arm.com
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run
loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit
was not set.
Replace all references to vcpu->run->immediate_exit with
!vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a
malicious userspace could invoked KVM_RUN with immediate_exit=true and
then after KVM reads it to set wants_to_run=false, flip it to false.
This would result in the vCPU running in KVM_RUN with
wants_to_run=false. This wouldn't cause any real bugs today but is a
dangerous landmine.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
If the BSP vCPU ID was already set, ensure it doesn't get excluded when
limiting vCPU IDs via KVM_CAP_MAX_VCPU_ID.
[mks: provide commit message, code by Sean]
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240614202859.3597745-4-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
Do not accept IDs which are definitely invalid by limit checking the
passed value against KVM_MAX_VCPU_IDS and 'max_vcpu_ids' if it was
already set.
This ensures invalid values, especially on 64-bit systems, don't go
unnoticed and lead to a valid id by chance when truncated by the final
assignment.
Fixes: 73880c80aa9c ("KVM: Break dependency between vcpu index in vcpus array and vcpu_id.")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240614202859.3597745-3-minipli@grsecurity.net
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
- Ensure that EFI runtime services are not unmapped by PAN on ARM
- Avoid freeing the memory holding the EFI memory map inadvertently on
x86
- Avoid a false positive kmemleak warning on arm64
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZm1QFgAKCRAwbglWLn0t
XDCpAP9tB6S9uQwDsR9PuxJfWOALJEqoMWCjGzLjt5HlGePlvAD9HaltvkT5p9Ff
TkfP4Ivl29BtuaNBIFGEiC6KJXETawc=
=Tvsr
-----END PGP SIGNATURE-----
Merge tag 'efi-fixes-for-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
"Another small set of EFI fixes. Only the x86 one is likely to affect
any actual users (and has a cc:stable), but the issue it fixes was
only observed in an unusual context (kexec in a confidential VM).
- Ensure that EFI runtime services are not unmapped by PAN on ARM
- Avoid freeing the memory holding the EFI memory map inadvertently
on x86
- Avoid a false positive kmemleak warning on arm64"
* tag 'efi-fixes-for-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi/arm64: Fix kmemleak false positive in arm64_efi_rt_init()
efi/x86: Free EFI memory map only when installing a new one.
efi/arm: Disable LPAE PAN when calling EFI runtime services
To allow execution at a level other than VMPL0, an SVSM must be present.
Allow the SEV-SNP guest to continue booting if an SVSM is detected and
the hypervisor supports the SVSM feature as indicated in the GHCB
hypervisor features bitmap.
[ bp: Massage a bit. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/2ce7cf281cce1d0cba88f3f576687ef75dc3c953.1717600736.git.thomas.lendacky@amd.com