46985 Commits

Author SHA1 Message Date
Sean Christopherson
32f55e475c KVM: nVMX: Request immediate exit iff pending nested event needs injection
When requesting an immediate exit from L2 in order to inject a pending
event, do so only if the pending event actually requires manual injection,
i.e. if and only if KVM actually needs to regain control in order to
deliver the event.

Avoiding the "immediate exit" isn't simply an optimization, it's necessary
to make forward progress, as the "already expired" VMX preemption timer
trick that KVM uses to force a VM-Exit has higher priority than events
that aren't directly injected.

At present time, this is a glorified nop as all events processed by
vmx_has_nested_events() require injection, but that will not hold true in
the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI.
I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired
VMX preemption timer will trigger VM-Exit before the virtual interrupt is
delivered, and KVM will effectively hang the vCPU in an endless loop of
forced immediate VM-Exits (because the pending virtual interrupt never
goes away).

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:04 -07:00
Sean Christopherson
d83c36d822 KVM: nVMX: Add a helper to get highest pending from Posted Interrupt vector
Add a helper to retrieve the highest pending vector given a Posted
Interrupt descriptor.  While the actual operation is straightforward, it's
surprisingly easy to mess up, e.g. if one tries to reuse lapic.c's
find_highest_vector(), which doesn't work with PID.PIR due to the APIC's
IRR and ISR component registers being physically discontiguous (they're
4-byte registers aligned at 16-byte intervals).

To make PIR handling more consistent with respect to IRR and ISR handling,
return -1 to indicate "no interrupt pending".

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:03 -07:00
Kai Huang
92c1e3cbf0 KVM: VMX: Switch __vmx_exit() and kvm_x86_vendor_exit() in vmx_exit()
In the vmx_init() error handling path, the __vmx_exit() is done before
kvm_x86_vendor_exit().  They should follow the same order in vmx_exit().

But currently __vmx_exit() is done after kvm_x86_vendor_exit() in
vmx_exit().  Switch the order of them to fix.

Fixes: e32b120071ea ("KVM: VMX: Do _all_ initialization before exposing /dev/kvm to userspace")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240627010524.3732488-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:58:23 -07:00
Sean Christopherson
23b2c5088d KVM: VMX: Remove unnecessary INVEPT[GLOBAL] from hardware enable path
Remove the completely pointess global INVEPT, i.e. EPT TLB flush, from
KVM's VMX enablement path.  KVM always does a targeted TLB flush when
using a "new" EPT root, in quotes because "new" simply means a root that
isn't currently being used by the vCPU.

KVM also _deliberately_ runs with stale TLB entries for defunct roots,
i.e. doesn't do a TLB flush when vCPUs stop using roots, precisely because
KVM does the flush on first use.  As called out by the comment in
kvm_mmu_load(), the reason KVM flushes on first use is because KVM can't
guarantee the correctness of past hypervisors.

Jumping back to the global INVEPT, when the painfully terse commit
1439442c7b25 ("KVM: VMX: Enable EPT feature for KVM") was added, the
effective TLB flush being performed was:

  static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
  {
          vpid_sync_vcpu_all(to_vmx(vcpu));
  }

I.e. KVM was not flushing EPT TLB entries when allocating a "new" root,
which very strongly suggests that the global INVEPT during hardware
enabling was a misguided hack that addressed the most obvious symptom,
but failed to fix the underlying bug.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240608001003.3296640-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:57:25 -07:00
Sean Christopherson
cb9fb5fc12 KVM: nVMX: Update VMCS12_REVISION comment to state it should never change
Rewrite the comment above VMCS12_REVISION to unequivocally state that the
ID must never change.  KVM_{G,S}ET_NESTED_STATE have been officially
supported for some time now, i.e. changing VMCS12_REVISION would break
userspace.

Opportunistically add a blurb to the CHECK_OFFSET() comment to make it
explicitly clear that new fields are allowed, i.e. that the restriction
on the layout is all about backwards compatibility.

No functional change intended.

Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240613190103.1054877-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:55:00 -07:00
Kees Cook
6db1208bf9 randomize_kstack: Remove non-functional per-arch entropy filtering
An unintended consequence of commit 9c573cd31343 ("randomize_kstack:
Improve entropy diffusion") was that the per-architecture entropy size
filtering reduced how many bits were being added to the mix, rather than
how many bits were being used during the offsetting. All architectures
fell back to the existing default of 0x3FF (10 bits), which will consume
at most 1KiB of stack space. It seems that this is working just fine,
so let's avoid the confusion and update everything to use the default.

The prior intent of the per-architecture limits were:

  arm64: capped at 0x1FF (9 bits), 5 bits effective
  powerpc: uncapped (10 bits), 6 or 7 bits effective
  riscv: uncapped (10 bits), 6 bits effective
  x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective
  s390: capped at 0xFF (8 bits), undocumented effective entropy

Current discussion has led to just dropping the original per-architecture
filters. The additional entropy appears to be safe for arm64, x86,
and s390. Quoting Arnd, "There is no point pretending that 15.75KB is
somehow safe to use while 15.00KB is not."

Co-developed-by: Yuntao Liu <liuyuntao12@huawei.com>
Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
Fixes: 9c573cd31343 ("randomize_kstack: Improve entropy diffusion")
Link: https://lore.kernel.org/r/20240617133721.377540-1-liuyuntao12@huawei.com
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Link: https://lore.kernel.org/r/20240619214711.work.953-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-28 08:54:56 -07:00
Sean Christopherson
704ec48fc2 KVM: SVM: Use sev_es_host_save_area() helper when initializing tsc_aux
Use sev_es_host_save_area() instead of open coding an equivalent when
setting the MSR_TSC_AUX field during setup.

No functional change intended.

Link: https://lore.kernel.org/r/20240617210432.1642542-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:53:00 -07:00
Sean Christopherson
34830b3c02 KVM: SVM: Force sev_es_host_save_area() to be inlined (for noinstr usage)
Force sev_es_host_save_area() to be always inlined, as it's used in the
low level VM-Enter/VM-Exit path, which is non-instrumentable.

  vmlinux.o: warning: objtool: svm_vcpu_enter_exit+0xb0: call to
    sev_es_host_save_area() leaves .noinstr.text section
  vmlinux.o: warning: objtool: svm_vcpu_enter_exit+0xbf: call to
    sev_es_host_save_area.isra.0() leaves .noinstr.text section

Fixes: c92be2fd8edf ("KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area")
Reported-by: Borislav Petkov <bp@alien8.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240617210432.1642542-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:53:00 -07:00
Jeff Johnson
8815d77cbc KVM: x86: Add missing MODULE_DESCRIPTION() macros
Add module descriptions for the vendor modules to fix allmodconfig
'make W=1' warnings:

  WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-intel.o
  WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-amd.o

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240622-md-kvm-v2-1-29a60f7c48b1@quicinc.com
[sean: split kvm.ko change to separate commit]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:47:12 -07:00
Pei Li
ebbdf37ce9 KVM: Validate hva in kvm_gpc_activate_hva() to fix __kvm_gpc_refresh() WARN
Check that the virtual address is "ok" when activating a gfn_to_pfn_cache
with a host VA to ensure that KVM never attempts to use a bad address.

This fixes a bug where KVM fails to check the incoming address when
handling KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA in kvm_xen_vcpu_set_attr().

Reported-by: syzbot+fd555292a1da3180fc82@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=fd555292a1da3180fc82
Tested-by: syzbot+fd555292a1da3180fc82@syzkaller.appspotmail.com
Signed-off-by: Pei Li <peili.dev@gmail.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240627-bug5-v2-1-2c63f7ee6739@gmail.com
[sean: rewrite changelog with --verbose]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:31:46 -07:00
Josh Poimboeuf
42c141fbb6 x86/bugs: Add 'spectre_bhi=vmexit' cmdline option
In cloud environments it can be useful to *only* enable the vmexit
mitigation and leave syscalls vulnerable.  Add that as an option.

This is similar to the old spectre_bhi=auto option which was removed
with the following commit:

  36d4fe147c87 ("x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto")

with the main difference being that this has a more descriptive name and
is disabled by default.

Mitigation switch requested by Maksim Davydov <davydov-max@yandex-team.ru>.

  [ bp: Massage. ]

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/2cbad706a6d5e1da2829e5e123d8d5c80330148c.1719381528.git.jpoimboe@kernel.org
2024-06-28 15:35:54 +02:00
Josh Poimboeuf
9142be9e64 x86/syscall: Mark exit[_group] syscall handlers __noreturn
The direct-call syscall dispatch function doesn't know that the exit()
and exit_group() syscall handlers don't return, so the call sites aren't
optimized accordingly.

Fix that by marking the exit syscall declarations __noreturn.

Fixes the following warnings:

  vmlinux.o: warning: objtool: x64_sys_call+0x2804: __x64_sys_exit() is missing a __noreturn annotation
  vmlinux.o: warning: objtool: ia32_sys_call+0x29b6: __ia32_sys_exit_group() is missing a __noreturn annotation

Fixes: 1e3ad78334a6 ("x86/syscall: Don't force use of indirect calls for system calls")
Closes: https://lkml.kernel.org/lkml/6dba9b32-db2c-4e6d-9500-7a08852f17a3@paulmck-laptop
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/5d8882bc077d8eadcc7fd1740b56dfb781f12288.1719381528.git.jpoimboe@kernel.org
2024-06-28 15:23:38 +02:00
Rafael J. Wysocki
f53b4bb83d Add support for amd-pstate core performance boost support which
allows controlling which CPU cores can operate above nominal
 frequencies for short periods of time.
 -----BEGIN PGP SIGNATURE-----
 
 iQJOBAABCgA4FiEECwtuSU6dXvs5GA2aLRkspiR3AnYFAmZ8fvEaHG1hcmlvLmxp
 bW9uY2llbGxvQGFtZC5jb20ACgkQLRkspiR3AnbmPQ/9FrmbFf8t0e7WQJ7wrlHz
 8HmeHGLNLQbOWLrNPP2um+33i97hxJ+h8RHnPnr9wzdDl1R+u2oR1vu5DXCYgBBA
 d9rLJv1YSFnEu9VPklAHWMyHHb+F6OsUyk6yPl8R50j2E3HOb/TjwLxIfxC0C80p
 ox2ffArMfO5iKEAcVkpKQuh0prWDoxl4eQ8UI2DoKLMu1UyZRmH/jWL8l1qNGpwF
 4nRwYl4xERF2qnMaszN+QZREirmXwzU5y1gylx25qKDpFwzotulkEyQDGVPfqBr2
 kTz0mvc+i1mrJ2P5MG5gi1Mgsxd5dA1VPhYDk+4vgE+oPnJp3kdBtOKWfkmN+mgn
 PB6gFMWJFpLm/Kl4Lu8TS3m+aE0Euctcu/pVYEhxeP5bEJ82gbxgT9/kd2hfMtMi
 6QbBTIpoJcLnUuMEaOXRYlpuAmaG3cp/gDI4UO8tid+BgoyGbOK8fkToL2s0mIFx
 JrH19ZBAEXSWcoMQVmY118H8Uy4J+1IsA4IlZweTV0ZQPQ/W8VQ2blfyvRo7dSGj
 JtGhkOYtXdtYKahqC06fyi5lfzy+huiLjQElBOFWTl5x+usntpeCuJrG2kZ/gAiS
 gxVsL1FX6J7cxN866ty3jdwNYwOt5/JwG/oq3buBCeKYobQB3qY9bK6V42siC+Qv
 bcmTcy0lrfzZoNW5fvo3JZE=
 =EZq/
 -----END PGP SIGNATURE-----

Merge tag 'amd-pstate-v6.11-2024-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux

Merge more amd-pstate driver updates for 6.11 from Mario Limonciello:

"Add support for amd-pstate core performance boost support which
 allows controlling which CPU cores can operate above nominal
 frequencies for short periods of time."

* tag 'amd-pstate-v6.11-2024-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux:
  Documentation: cpufreq: amd-pstate: update doc for Per CPU boost control method
  cpufreq: amd-pstate: Cap the CPPC.max_perf to nominal_perf if CPB is off
  cpufreq: amd-pstate: initialize core precision boost state
  cpufreq: acpi: move MSR_K7_HWCR_CPB_DIS_BIT into msr-index.h
2024-06-27 21:26:36 +02:00
Rafael J. Wysocki
b11ec63abe Merge back cpufreq material for v6.11. 2024-06-27 21:20:10 +02:00
Perry Yuan
4f460bff7b cpufreq: acpi: move MSR_K7_HWCR_CPB_DIS_BIT into msr-index.h
There are some other drivers also need to use the
MSR_K7_HWCR_CPB_DIS_BIT for CPB control bit, so it makes sense to move
the definition to a common header file to allow other driver to use it.

No intentional functional impact.

Suggested-by: Gautham Ranjal Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Huang Rui <ray.huang@amd.com>
Link: https://lore.kernel.org/r/78b6c75e6cffddce3e950dd543af6ae9f8eeccc3.1718988436.git.perry.yuan@amd.com
Link: https://lore.kernel.org/r/20240626042733.3747-2-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2024-06-26 15:48:21 -05:00
Alexey Makhalov
57b7b6acb4 x86/vmware: Add TDX hypercall support
VMware hypercalls use I/O port, VMCALL or VMMCALL instructions.  Add a call to
__tdx_hypercall() in order to support TDX guests.

No change in high bandwidth hypercalls, as only low bandwidth ones are supported
for TDX guests.

  [ bp: Massage, clear on-stack struct tdx_module_args variable. ]

Co-developed-by: Tim Merrifield <tim.merrifield@broadcom.com>
Signed-off-by: Tim Merrifield <tim.merrifield@broadcom.com>
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-9-alexey.makhalov@broadcom.com
2024-06-25 17:15:48 +02:00
Alexey Makhalov
9dfb18031f x86/vmware: Remove legacy VMWARE_HYPERCALL* macros
No more direct use of these macros should be allowed. The vmware_hypercallX API
still uses the new implementation of VMWARE_HYPERCALL macro internally, but it
is not exposed outside of the vmware.h.

Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-8-alexey.makhalov@broadcom.com
2024-06-25 17:15:48 +02:00
Alexey Makhalov
86cb65448d x86/vmware: Correct macro names
VCPU_RESERVED and LEGACY_X2APIC are not VMware hypercall commands.  These are
bits in the return value of the VMWARE_CMD_GETVCPU_INFO command.  Change
VMWARE_CMD_ prefix to GETVCPU_INFO_ one. And move the bit-shift
operation into the macro body.

Fixes: 4cca6ea04d31c ("x86/apic: Allow x2apic without IR on VMware platform")
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-7-alexey.makhalov@broadcom.com
2024-06-25 17:15:48 +02:00
Alexey Makhalov
b2c13c23ea x86/vmware: Use VMware hypercall API
Remove VMWARE_CMD macro and move to vmware_hypercall API.
No functional changes intended.

Use u32/u64 instead of uint32_t/uint64_t across the file.

Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-6-alexey.makhalov@broadcom.com
2024-06-25 17:15:47 +02:00
Alexey Makhalov
34bf25e820 x86/vmware: Introduce VMware hypercall API
Introduce a vmware_hypercall family of functions. It is a common implementation
to be used by the VMware guest code and virtual device drivers in architecture
independent manner.

The API consists of vmware_hypercallX and vmware_hypercall_hb_{out,in}
set of functions analogous to KVM's hypercall API. Architecture-specific
implementation is hidden inside.

It will simplify future enhancements in VMware hypercalls such as SEV-ES and
TDX related changes without needs to modify a caller in device drivers code.

Current implementation extends an idea from

  bac7b4e84323 ("x86/vmware: Update platform detection code for VMCALL/VMMCALL hypercalls")

to have a slow, but safe path vmware_hypercall_slow() earlier during the boot
when alternatives are not yet applied.  The code inherits VMWARE_CMD logic from
the commit mentioned above.

Move common macros from vmware.c to vmware.h.

  [ bp: Fold in a fix:
    https://lore.kernel.org/r/20240625083348.2299-1-alexey.makhalov@broadcom.com ]

Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-2-alexey.makhalov@broadcom.com
2024-06-25 17:01:33 +02:00
Arnd Bergmann
d3882564a7 syscalls: fix compat_sys_io_pgetevents_time64 usage
Using sys_io_pgetevents() as the entry point for compat mode tasks
works almost correctly, but misses the sign extension for the min_nr
and nr arguments.

This was addressed on parisc by switching to
compat_sys_io_pgetevents_time64() in commit 6431e92fc827 ("parisc:
io_pgetevents_time64() needs compat syscall in 32-bit compat mode"),
as well as by using more sophisticated system call wrappers on x86 and
s390. However, arm64, mips, powerpc, sparc and riscv still have the
same bug.

Change all of them over to use compat_sys_io_pgetevents_time64()
like parisc already does. This was clearly the intention when the
function was originally added, but it got hooked up incorrectly in
the tables.

Cc: stable@vger.kernel.org
Fixes: 48166e6ea47d ("y2038: add 64-bit time_t syscalls to all 32-bit architectures")
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-06-25 15:57:20 +02:00
Brian Johannesmeyer
bf6ab33d84 x86/kmsan: Fix hook for unaligned accesses
When called with a 'from' that is not 4-byte-aligned, string_memcpy_fromio()
calls the movs() macro to copy the first few bytes, so that 'from' becomes
4-byte-aligned before calling rep_movs(). This movs() macro modifies 'to', and
the subsequent line modifies 'n'.

As a result, on unaligned accesses, kmsan_unpoison_memory() uses the updated
(aligned) values of 'to' and 'n'. Hence, it does not unpoison the entire
region.

Save the original values of 'to' and 'n', and pass those to
kmsan_unpoison_memory(), so that the entire region is unpoisoned.

Signed-off-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/r/20240523215029.4160518-1-bjohannesmeyer@gmail.com
2024-06-25 11:37:21 +02:00
Ilpo Järvinen
7821fa101e x86/platform/iosf_mbi: Convert PCIBIOS_* return codes to errnos
iosf_mbi_pci_{read,write}_mdr() use pci_{read,write}_config_dword()
that return PCIBIOS_* codes but functions also return -ENODEV which are
not compatible error codes. As neither of the functions are related to
PCI read/write functions, they should return normal errnos.

Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal
errno before returning it.

Fixes: 46184415368a ("arch: x86: New MailBox support driver for Intel SOC's")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240527125538.13620-4-ilpo.jarvinen@linux.intel.com
2024-06-24 19:28:18 +02:00
Ilpo Järvinen
e9d7b435df x86/pci/xen: Fix PCIBIOS_* return code handling
xen_pcifront_enable_irq() uses pci_read_config_byte() that returns
PCIBIOS_* codes. The error handling, however, assumes the codes are
normal errnos because it checks for < 0.

xen_pcifront_enable_irq() also returns the PCIBIOS_* code back to the
caller but the function is used as the (*pcibios_enable_irq) function
which should return normal errnos.

Convert the error check to plain non-zero check which works for
PCIBIOS_* return codes and convert the PCIBIOS_* return code using
pcibios_err_to_errno() into normal errno before returning it.

Fixes: 3f2a230caf21 ("xen: handled remapped IRQs when enabling a pcifront PCI device.")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20240527125538.13620-3-ilpo.jarvinen@linux.intel.com
2024-06-24 19:21:25 +02:00
Ilpo Järvinen
724852059e x86/pci/intel_mid_pci: Fix PCIBIOS_* return code handling
intel_mid_pci_irq_enable() uses pci_read_config_byte() that returns
PCIBIOS_* codes. The error handling, however, assumes the codes are
normal errnos because it checks for < 0.

intel_mid_pci_irq_enable() also returns the PCIBIOS_* code back to the
caller but the function is used as the (*pcibios_enable_irq) function
which should return normal errnos.

Convert the error check to plain non-zero check which works for
PCIBIOS_* return codes and convert the PCIBIOS_* return code using
pcibios_err_to_errno() into normal errno before returning it.

Fixes: 5b395e2be6c4 ("x86/platform/intel-mid: Make IRQ allocation a bit more flexible")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240527125538.13620-2-ilpo.jarvinen@linux.intel.com
2024-06-24 19:19:55 +02:00
Ilpo Järvinen
ec0b4c4d45 x86/of: Return consistent error type from x86_of_pci_irq_enable()
x86_of_pci_irq_enable() returns PCIBIOS_* code received from
pci_read_config_byte() directly and also -EINVAL which are not
compatible error types. x86_of_pci_irq_enable() is used as
(*pcibios_enable_irq) function which should not return PCIBIOS_* codes.

Convert the PCIBIOS_* return code from pci_read_config_byte() into
normal errno using pcibios_err_to_errno().

Fixes: 96e0a0797eba ("x86: dtb: Add support for PCI devices backed by dtb nodes")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240527125538.13620-1-ilpo.jarvinen@linux.intel.com
2024-06-24 19:11:31 +02:00
Linus Torvalds
b9e6612d61 - An ARM-relevant fix to not free default RMIDs of a resource control group
- A randconfig build fix for the VMware virtual GPU driver
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmZ32zQACgkQEsHwGGHe
 VUqVdA//fjAlr3ZA85b+BrSV0WgGYb5BVCdZ9W2NYnoWe771VcNKM/83skoVudFu
 jKEYuguLPizQrMgiQKKBNFwhLj5X7/DeF7Wicl7BcIQ4RFt03QfwkGSsHvXxq1p8
 fcGy6WweBKaPVdCi9/UunbDOQGb8YK/gog7jR2J/tT+rPNFyO9YWRREfPE+/6Jso
 gVv71HWtUINzJkEwcW5E6RACCrcYLdlYZwpdf1OQOzprIsLXOc8yAPpks7NywrXY
 jn4Lhw31SiySZFuo7DIhlZVESaXvbvaVHw5f4joOvfzSQ+HQhjsoK+hqkqfHEFJJ
 JGEyBrXB5J1AZ4AG7Jmm+I04CIhvnl+P8R4VluxpQ6PJTVa/wXoFanHGY79VTQHd
 CD5o6STbv4xYSWWq0boI57d96gZDmRY8qbn7tZU+mb1UIoTU2YymkdM/OXeUfdzE
 ltbVqJIWxjTFd5Ar07IBFY3swjcOpr0HJ4FWWc2ybSDTK1+h8swaT82BSSJfoLio
 tG/7IeJ7ycnVufuzpOY7VbLah4kdM5irhETatkfw18cQAiKZU/AOop+mN1lgRaJ+
 RE62KPXwAv/45w1t19oLdIWCd4EqK+PfBHQcqGaIpYe0SMIVtB63r4yuv/lb0AXn
 MizuET14Qr1SpXl2Slw6V8973e58gKB45o7vaP6M0cjALuRk7Hc=
 =WXpS
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.10_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - An ARM-relevant fix to not free default RMIDs of a resource control
   group

 - A randconfig build fix for the VMware virtual GPU driver

* tag 'x86_urgent_for_v6.10_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Don't try to free nonexistent RMIDs
  drm/vmwgfx: Fix missing HYPERVISOR_GUEST dependency
2024-06-23 07:16:13 -07:00
Linus Torvalds
fe37fe2a5e ARM:
* Fix dangling references to a redistributor region if the vgic was
   prematurely destroyed.
 
 * Properly mark FFA buffers as released, ensuring that both parties
   can make forward progress.
 
 x86:
 
 * Allow getting/setting MSRs for SEV-ES guests, if they're using the pre-6.9
   KVM_SEV_ES_INIT API.
 
 * Always sync pending posted interrupts to the IRR prior to IOAPIC
   route updates, so that EOIs are intercepted properly if the old routing
   table requested that.
 
 Generic:
 
 * Avoid __fls(0)
 
 * Fix reference leak on hwpoisoned page
 
 * Fix a race in kvm_vcpu_on_spin() by ensuring loads and stores are atomic.
 
 * Fix bug in __kvm_handle_hva_range() where KVM calls a function pointer
   that was intended to be a marker only (nothing bad happens but kind of
   a mine and also technically undefined behavior)
 
 * Do not bother accounting allocations that are small and freed before
   getting back to userspace.
 
 Selftests:
 
 * Fix compilation for RISC-V.
 
 * Fix a "shift too big" goof in the KVM_SEV_INIT2 selftest.
 
 * Compute the max mappable gfn for KVM selftests on x86 using GuestMaxPhyAddr
   from KVM's supported CPUID (if it's available).
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmZ1sNwUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroO8Rwf/ZH+zVOkKdrA0XT71nToc9AkqObPO
 mBpV5p+E4boVHSWNQgY7R0yu1ViLc+HotTYf7MoQGeobm60YtDkWHlxcKrQD672C
 cLRdl02iRRDGMTRAhpr9jvT/yMHB5kYDxEYmO44nPJKwodcb4/4RJQpt8wyslT2G
 uUDpnYMFmSZ8/Zt7IznSEcSx1D+4WFqLT2AZPsJ55w45BFiI+5uRQ/kRaM9iM0+r
 yuOQCCK3+pV4CqA+ckbZ6j6+RufcovjEdYCoxLQDOdK6tQTD9aqwJFQ/o2tc+fJT
 Hj1MRRsqmdOePdjguBMsfDrEnjXoBveAt96BVheavbpC1UaWp5n0r8p2sA==
 =Egkk
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Fix dangling references to a redistributor region if the vgic was
     prematurely destroyed.

   - Properly mark FFA buffers as released, ensuring that both parties
     can make forward progress.

  x86:

   - Allow getting/setting MSRs for SEV-ES guests, if they're using the
     pre-6.9 KVM_SEV_ES_INIT API.

   - Always sync pending posted interrupts to the IRR prior to IOAPIC
     route updates, so that EOIs are intercepted properly if the old
     routing table requested that.

  Generic:

   - Avoid __fls(0)

   - Fix reference leak on hwpoisoned page

   - Fix a race in kvm_vcpu_on_spin() by ensuring loads and stores are
     atomic.

   - Fix bug in __kvm_handle_hva_range() where KVM calls a function
     pointer that was intended to be a marker only (nothing bad happens
     but kind of a mine and also technically undefined behavior)

   - Do not bother accounting allocations that are small and freed
     before getting back to userspace.

  Selftests:

   - Fix compilation for RISC-V.

   - Fix a "shift too big" goof in the KVM_SEV_INIT2 selftest.

   - Compute the max mappable gfn for KVM selftests on x86 using
     GuestMaxPhyAddr from KVM's supported CPUID (if it's available)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SEV-ES: Fix svm_get_msr()/svm_set_msr() for KVM_SEV_ES_INIT guests
  KVM: Discard zero mask with function kvm_dirty_ring_reset
  virt: guest_memfd: fix reference leak on hwpoisoned page
  kvm: do not account temporary allocations to kmem
  MAINTAINERS: Drop Wanpeng Li as a Reviewer for KVM Paravirt support
  KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routes
  KVM: Stop processing *all* memslots when "null" mmu_notifier handler is found
  KVM: arm64: FFA: Release hyp rx buffer
  KVM: selftests: Fix RISC-V compilation
  KVM: arm64: Disassociate vcpus from redistributor region on teardown
  KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin()
  KVM: selftests: x86: Prioritize getting max_gfn from GuestPhysBits
  KVM: selftests: Fix shift of 32 bit unsigned int more than 32 bits
2024-06-22 07:41:57 -07:00
Michael Roth
cf6d9d2d24 KVM: SEV-ES: Fix svm_get_msr()/svm_set_msr() for KVM_SEV_ES_INIT guests
With commit 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA
encryption"), older VMMs like QEMU 9.0 and older will fail when booting
SEV-ES guests with something like the following error:

  qemu-system-x86_64: error: failed to get MSR 0x174
  qemu-system-x86_64: ../qemu.git/target/i386/kvm/kvm.c:3950: kvm_get_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.

This is because older VMMs that might still call
svm_get_msr()/svm_set_msr() for SEV-ES guests after guest boot even if
those interfaces were essentially just noops because of the vCPU state
being encrypted and stored separately in the VMSA. Now those VMMs will
get an -EINVAL and generally crash.

Newer VMMs that are aware of KVM_SEV_INIT2 however are already aware of
the stricter limitations of what vCPU state can be sync'd during
guest run-time, so newer QEMU for instance will work both for legacy
KVM_SEV_ES_INIT interface as well as KVM_SEV_INIT2.

So when using KVM_SEV_INIT2 it's okay to assume userspace can deal with
-EINVAL, whereas for legacy KVM_SEV_ES_INIT the kernel might be dealing
with either an older VMM and so it needs to assume that returning
-EINVAL might break the VMM.

Address this by only returning -EINVAL if the guest was started with
KVM_SEV_INIT2. Otherwise, just silently return.

Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Nikunj A Dadhania <nikunj@amd.com>
Reported-by: Srikanth Aithal <sraithal@amd.com>
Closes: https://lore.kernel.org/lkml/37usuu4yu4ok7be2hqexhmcyopluuiqj3k266z4gajc2rcj4yo@eujb23qc3zcm/
Fixes: 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA encryption")
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240604233510.764949-1-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-21 07:11:29 -04:00
Rafael Passos
9919c5c98c bpf: remove unused parameter in bpf_jit_binary_pack_finalize
Fixes a compiler warning. the bpf_jit_binary_pack_finalize function
was taking an extra bpf_prog parameter that went unused.
This removves it and updates the callers accordingly.

Signed-off-by: Rafael Passos <rafael@rcpassos.me>
Link: https://lore.kernel.org/r/20240615022641.210320-2-rafael@rcpassos.me
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-20 19:50:26 -07:00
Leon Hwang
f663a03c8e bpf, x64: Remove tail call detection
As 'prog->aux->tail_call_reachable' is correct for tail call present,
it's unnecessary to detect tail call in x86 jit.

Therefore, let's remove it.

Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240610124224.34673-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-20 19:48:29 -07:00
Rick Edgecombe
c2f38f75fc KVM: x86/tdp_mmu: Take a GFN in kvm_tdp_mmu_fast_pf_get_last_sptep()
Pass fault->gfn into kvm_tdp_mmu_fast_pf_get_last_sptep(), instead of
passing fault->addr and then converting it to a GFN.

Future changes will make fault->addr and fault->gfn differ when running
TDX guests. The GFN will be conceptually the same as it is for normal VMs,
but fault->addr may contain a TDX specific bit that differentiates between
"shared" and "private" memory. This bit will be used to direct faults to
be handled on different roots, either the normal "direct" root or a new
type of root that handles private memory. The TDP iterators will process
the traditional GFN concept and apply the required TDX specifics depending
on the root type. For this reason, it needs to operate on regular GFN and
not the addr, which may contain these special TDX specific bits.

Today kvm_tdp_mmu_fast_pf_get_last_sptep() takes fault->addr and then
immediately converts it to a GFN with a bit shift. However, this would
unfortunately retain the TDX specific bits in what is supposed to be a
traditional GFN. Excluding TDX's needs, it is also is unnecessary to pass
fault->addr and convert it to a GFN when the GFN is already on hand.

So instead just pass the GFN into kvm_tdp_mmu_fast_pf_get_last_sptep() and
use it directly.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240619223614.290657-9-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20 18:43:31 -04:00
Rick Edgecombe
964cea8171 KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE
Rename REMOVED_SPTE to FROZEN_SPTE so that it can be used for other
multi-part operations.

REMOVED_SPTE is used as a non-present intermediate value for multi-part
operations that can happen when a thread doesn't have an MMU write lock.
Today these operations are when removing PTEs.

However, future changes will want to use the same concept for setting a
PTE. In that case the REMOVED_SPTE name does not quite fit. So rename it
to FROZEN_SPTE so it can be used for both types of operations.

Also rename the relevant helpers and comments that refer to "removed"
within the context of the SPTE value. Take care to not update naming
referring the "remove" operations, which are still distinct.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240619223614.290657-2-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20 18:43:31 -04:00
Paolo Bonzini
02b0d3b9d4 Merge branch 'kvm-6.10-fixes' into HEAD 2024-06-20 17:31:50 -04:00
Isaku Yamahata
8a4e2742a5 KVM: x86/tdp_mmu: Sprinkle __must_check
The TDP MMU function __tdp_mmu_set_spte_atomic uses a cmpxchg64 to replace
the SPTE value and returns -EBUSY on failure.  The caller must check the
return value and retry.  Add __must_check to it, as well as to two more
functions that forward the return value of __tdp_mmu_set_spte_atomic to
their caller.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20 17:28:44 -04:00
Borislav Petkov (AMD)
78ce84b9e0 x86/cpufeatures: Flip the /proc/cpuinfo appearance logic
I'm getting tired of telling people to put a magic "" in the

  #define X86_FEATURE		/* "" ... */

comment to hide the new feature flag from the user-visible
/proc/cpuinfo.

Flip the logic to make it explicit: an explicit "<name>" in the comment
adds the flag to /proc/cpuinfo and otherwise not, by default.

Add the "<name>" of all the existing flags to keep backwards
compatibility with userspace.

There should be no functional changes resulting from this.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240618113840.24163-1-bp@kernel.org
2024-06-20 21:04:22 +02:00
Kees Cook
ef40d28f17 randomize_kstack: Remove non-functional per-arch entropy filtering
An unintended consequence of commit 9c573cd31343 ("randomize_kstack:
Improve entropy diffusion") was that the per-architecture entropy size
filtering reduced how many bits were being added to the mix, rather than
how many bits were being used during the offsetting. All architectures
fell back to the existing default of 0x3FF (10 bits), which will consume
at most 1KiB of stack space. It seems that this is working just fine,
so let's avoid the confusion and update everything to use the default.

The prior intent of the per-architecture limits were:

  arm64: capped at 0x1FF (9 bits), 5 bits effective
  powerpc: uncapped (10 bits), 6 or 7 bits effective
  riscv: uncapped (10 bits), 6 bits effective
  x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective
  s390: capped at 0xFF (8 bits), undocumented effective entropy

Current discussion has led to just dropping the original per-architecture
filters. The additional entropy appears to be safe for arm64, x86,
and s390. Quoting Arnd, "There is no point pretending that 15.75KB is
somehow safe to use while 15.00KB is not."

Co-developed-by: Yuntao Liu <liuyuntao12@huawei.com>
Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
Fixes: 9c573cd31343 ("randomize_kstack: Improve entropy diffusion")
Link: https://lore.kernel.org/r/20240617133721.377540-1-liuyuntao12@huawei.com
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Link: https://lore.kernel.org/r/20240619214711.work.953-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-20 11:34:46 -07:00
Sean Christopherson
f3ced000a2 KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routes
Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC
routes, irrespective of whether the I/O APIC is emulated by userspace or
by KVM.  If a level-triggered interrupt routed through the I/O APIC is
pending or in-service for a vCPU, KVM needs to intercept EOIs on said
vCPU even if the vCPU isn't the destination for the new routing, e.g. if
servicing an interrupt using the old routing races with I/O APIC
reconfiguration.

Commit fceb3a36c29a ("KVM: x86: ioapic: Fix level-triggered EOI and
userspace I/OAPIC reconfigure race") fixed the common cases, but
kvm_apic_pending_eoi() only checks if an interrupt is in the local
APIC's IRR or ISR, i.e. misses the uncommon case where an interrupt is
pending in the PIR.

Failure to intercept EOI can manifest as guest hangs with Windows 11 if
the guest uses the RTC as its timekeeping source, e.g. if the VMM doesn't
expose a more modern form of time to the guest.

Cc: stable@vger.kernel.org
Cc: Adamos Ttofari <attofari@amazon.de>
Cc: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240611014845.82795-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20 14:18:02 -04:00
Masahiro Yamada
469169803d x86/kconfig: Add as-instr64 macro to properly evaluate AS_WRUSS
Some instructions are only available on the 64-bit architecture.

Bi-arch compilers that default to -m32 need the explicit -m64 option
to evaluate them properly.

Fixes: 18e66b695e78 ("x86/shstk: Add Kconfig option for shadow stack")
Closes: https://lore.kernel.org/all/20240612-as-instr-opt-wrussq-v2-1-bd950f7eead7@gmail.com/
Reported-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://lore.kernel.org/r/20240612050257.3670768-1-masahiroy@kernel.org
2024-06-20 19:48:18 +02:00
Kees Cook
d6f635bcac x86/alternatives: Make FineIBT mode Kconfig selectable
Since FineIBT performs checking at the destination, it is weaker against
attacks that can construct arbitrary executable memory contents. As such,
some system builders want to run with FineIBT disabled by default. Allow
the "cfi=kcfi" boot param mode to be selectable through Kconfig via the
newly introduced CONFIG_CFI_AUTO_DEFAULT.

Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240501000218.work.998-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-19 12:41:08 -07:00
Linus Torvalds
4b8fa1173c x86-64: word-at-a-time: improve byte count calculations
This switches x86-64 over to using 'tzcount' instead of the integer
multiply trick to turn the bytemask information into actual byte counts.

We even had a comment saying that a fast bit count instruction is better
than a multiply, but x86 bit counting has traditionally been
"questionably fast", and so avoiding it was the right thing back in the
days.

Now, on any half-way modern core, using bit counting is cheaper and
smaller than the large constant multiply, so let's just switch over.

Note that as part of switching over to counting bits, we also do it at a
different point.  We used to create the byte count from the final byte
mask, but once you use the 'tzcount' instruction (aka 'bsf' on older
CPU's), you can actually count the leading zeroes using a value we have
available earlier.

In fact, we can just use the very first mask of bits that tells us
whether we have any zero bytes at all.  The zero bytes in the word will
have the high bit set, so just doing 'tzcount' on that value and
dividing by 8 will give the number of bytes that precede the first NUL
character, which is exactly what we want.

Note also that the input value to the tzcount is by definition not zero,
since that is the condition that we already used to check the whole "do
we have any zero bytes at all".  So we don't need to worry about the
legacy instruction behavior of pre-lzcount days when 'bsf' didn't have a
result for zero input.

The 32-bit code continues to use the bimple bit op trick that is faster
even on newer cores, but particularly on the older 32-bit-only ones.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-06-19 12:35:18 -07:00
Linus Torvalds
e3c92e8171 runtime constants: add x86 architecture support
This implements the runtime constant infrastructure for x86, allowing
the dcache d_hash() function to be generated using as a constant for
hash table address followed by shift by a constant of the hash index.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-06-19 12:34:34 -07:00
Linus Torvalds
8a2462df15 x86/uaccess: Improve the 8-byte getuser() case
Streamline the 8-byte case and drop the special handling. Use a macro
which hides the exception handling.

No functional changes.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/CAHk-=whYb2L_atsRk9pBiFiVLGe5wNZLHhRinA69yu6FiKvDsw@mail.gmail.com
2024-06-19 19:56:07 +02:00
Dave Martin
739c976579 x86/resctrl: Don't try to free nonexistent RMIDs
Commit

  6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index")

adds logic to map individual monitoring groups into a global index space used
for tracking allocated RMIDs.

Attempts to free the default RMID are ignored in free_rmid(), and this works
fine on x86.

With arm64 MPAM, there is a latent bug here however: on platforms with no
monitors exposed through resctrl, each control group still gets a different
monitoring group ID as seen by the hardware, since the CLOSID always forms part
of the monitoring group ID.

This means that when removing a control group, the code may try to free this
group's default monitoring group RMID for real.  If there are no monitors
however, the RMID tracking table rmid_ptrs[] would be a waste of memory and is
never allocated, leading to a splat when free_rmid() tries to dereference the
table.

One option would be to treat RMID 0 as special for every CLOSID, but this would
be ugly since bookkeeping still needs to be done for these monitoring group IDs
when there are monitors present in the hardware.

Instead, add a gating check of resctrl_arch_mon_capable() in free_rmid(), and
just do nothing if the hardware doesn't have monitors.

This fix mirrors the gating checks already present in
mkdir_rdt_prepare_rmid_alloc() and elsewhere.

No functional change on x86.

  [ bp: Massage commit message. ]

Fixes: 6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240618140152.83154-1-Dave.Martin@arm.com
2024-06-19 11:39:09 +02:00
Jani Nikula
d754ed2821 Merge drm/drm-next into drm-intel-next
Sync to v6.10-rc3.

Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-06-19 11:38:31 +03:00
David Matlack
a6816314af KVM: Introduce vcpu->wants_to_run
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run
loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit
was not set.

Replace all references to vcpu->run->immediate_exit with
!vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a
malicious userspace could invoked KVM_RUN with immediate_exit=true and
then after KVM reads it to set wants_to_run=false, flip it to false.
This would result in the vCPU running in KVM_RUN with
wants_to_run=false. This wouldn't cause any real bugs today but is a
dangerous landmine.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18 09:20:01 -07:00
Sean Christopherson
d29bf2ca14 KVM: x86: Prevent excluding the BSP on setting max_vcpu_ids
If the BSP vCPU ID was already set, ensure it doesn't get excluded when
limiting vCPU IDs via KVM_CAP_MAX_VCPU_ID.

[mks: provide commit message, code by Sean]

Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240614202859.3597745-4-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18 08:59:50 -07:00
Mathias Krause
7c305d5118 KVM: x86: Limit check IDs for KVM_SET_BOOT_CPU_ID
Do not accept IDs which are definitely invalid by limit checking the
passed value against KVM_MAX_VCPU_IDS and 'max_vcpu_ids' if it was
already set.

This ensures invalid values, especially on 64-bit systems, don't go
unnoticed and lead to a valid id by chance when truncated by the final
assignment.

Fixes: 73880c80aa9c ("KVM: Break dependency between vcpu index in vcpus array and vcpu_id.")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240614202859.3597745-3-minipli@grsecurity.net
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18 08:59:36 -07:00
Linus Torvalds
46d1907d1c EFI fixes for v6.10 #3
- Ensure that EFI runtime services are not unmapped by PAN on ARM
 - Avoid freeing the memory holding the EFI memory map inadvertently on
   x86
 - Avoid a false positive kmemleak warning on arm64
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZm1QFgAKCRAwbglWLn0t
 XDCpAP9tB6S9uQwDsR9PuxJfWOALJEqoMWCjGzLjt5HlGePlvAD9HaltvkT5p9Ff
 TkfP4Ivl29BtuaNBIFGEiC6KJXETawc=
 =Tvsr
 -----END PGP SIGNATURE-----

Merge tag 'efi-fixes-for-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI fixes from Ard Biesheuvel:
 "Another small set of EFI fixes. Only the x86 one is likely to affect
  any actual users (and has a cc:stable), but the issue it fixes was
  only observed in an unusual context (kexec in a confidential VM).

   - Ensure that EFI runtime services are not unmapped by PAN on ARM

   - Avoid freeing the memory holding the EFI memory map inadvertently
     on x86

   - Avoid a false positive kmemleak warning on arm64"

* tag 'efi-fixes-for-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi/arm64: Fix kmemleak false positive in arm64_efi_rt_init()
  efi/x86: Free EFI memory map only when installing a new one.
  efi/arm: Disable LPAE PAN when calling EFI runtime services
2024-06-18 07:48:56 -07:00
Tom Lendacky
99ef9f5984 x86/sev: Allow non-VMPL0 execution when an SVSM is present
To allow execution at a level other than VMPL0, an SVSM must be present.
Allow the SEV-SNP guest to continue booting if an SVSM is detected and
the hypervisor supports the SVSM feature as indicated in the GHCB
hypervisor features bitmap.

  [ bp: Massage a bit. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/2ce7cf281cce1d0cba88f3f576687ef75dc3c953.1717600736.git.thomas.lendacky@amd.com
2024-06-17 20:42:58 +02:00