45512 Commits

Author SHA1 Message Date
Bjorn Helgaas
1dfc86af06 x86/pci: Reorder pci_mmcfg_arch_map() definition before calls
The typical style is to define functions before calling them.  Move
pci_mmcfg_arch_map() and pci_mmcfg_arch_unmap() earlier so they're defined
before they're called.  No functional change intended.

Link: https://lore.kernel.org/r/20231121183643.249006-10-helgaas@kernel.org
Tested-by: Tomasz Pala <gotar@polanet.pl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-05 10:57:03 -06:00
Bjorn Helgaas
f126598326 x86/pci: Return pci_mmconfig_add() failure early
If pci_mmconfig_alloc() fails, return the failure early so it's obvious
that the failure is the exception, and the success is the normal case.  No
functional change intended.

Link: https://lore.kernel.org/r/20231121183643.249006-9-helgaas@kernel.org
Tested-by: Tomasz Pala <gotar@polanet.pl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-05 10:56:59 -06:00
Bjorn Helgaas
f284dff47b x86/pci: Comment pci_mmconfig_insert() obscure MCFG dependency
In pci_mmconfig_insert(), there's no reference to "addr" between locking
pci_mmcfg_lock and testing "addr", so it *looks* like we should move the
test before the lock.

But 07f9b61c3915 ("x86/PCI: MMCONFIG: Check earlier for MMCONFIG region at
address zero") did that, which broke things by returning -EINVAL when
"addr" is zero instead of -EEXIST.

So 07f9b61c3915 was reverted by 67d470e0e171 ("Revert "x86/PCI: MMCONFIG:
Check earlier for MMCONFIG region at address zero"").

Add a comment about this issue to prevent it from happening again.

Link: https://lore.kernel.org/r/20231121183643.249006-8-helgaas@kernel.org
Tested-by: Tomasz Pala <gotar@polanet.pl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-05 10:56:56 -06:00
Bjorn Helgaas
d26e7fc3d9 x86/pci: Rename pci_mmcfg_check_reserved() to pci_mmcfg_reserved()
"pci_mmcfg_check_reserved()" doesn't give a hint about what the boolean
return value means.  Rename it to pci_mmcfg_reserved() so testing
"if (pci_mmcfg_reserved())" makes sense.

Update callers to treat the return value as boolean instead of comparing
with 0.

Link: https://lore.kernel.org/r/20231121183643.249006-7-helgaas@kernel.org
Tested-by: Tomasz Pala <gotar@polanet.pl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-05 10:56:53 -06:00
Bjorn Helgaas
9ad67912d0 x86/pci: Rename acpi_mcfg_check_entry() to acpi_mcfg_valid_entry()
"acpi_mcfg_check_entry()" doesn't give a hint about what the return value
means.  Rename it to "acpi_mcfg_valid_entry()", convert the return value to
bool, and update the return values and callers to match so testing
"if (acpi_mcfg_valid_entry())" makes sense.

Link: https://lore.kernel.org/r/20231121183643.249006-6-helgaas@kernel.org
Tested-by: Tomasz Pala <gotar@polanet.pl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-05 10:56:49 -06:00
Bjorn Helgaas
704891033b x86/pci: Rename 'MMCONFIG' to 'ECAM', use pr_fmt
The "MMCONFIG" term is not used in PCI/PCIe specs.  Replace it with "ECAM",
the term used in PCIe r6.0, sec 7.2.2.

Define pr_fmt() instead of repeating PREFIX in every log message.

Link: https://lore.kernel.org/r/20231121183643.249006-5-helgaas@kernel.org
Tested-by: Tomasz Pala <gotar@polanet.pl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-05 10:56:46 -06:00
Bjorn Helgaas
286ae88c9e x86/pci: Add MCFG debug logging
MCFG handling is a frequent source of problems.  Add more logging to aid in
debugging.

Enable the logging with CONFIG_DYNAMIC_DEBUG=y and the kernel boot
parameter 'dyndbg="file arch/x86/pci +p"'.

Link: https://lore.kernel.org/r/20231121183643.249006-4-helgaas@kernel.org
Tested-by: Tomasz Pala <gotar@polanet.pl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-05 10:56:39 -06:00
Bjorn Helgaas
e1fad9dd25 x86/pci: Reword ECAM EfiMemoryMappedIO logging to avoid 'reserved'
fd3a8cff4d4a ("x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM
space") added the concept of using the EFI memory map to help decide
whether ECAM space mentioned in the MCFG table is valid.

Unfortunately it described that EfiMemoryMappedIO space as "reserved", but
it is actually not *reserved* by the EFI memory map.  EfiMemoryMappedIO
only means the firmware requested that the OS map this space for use by
firmware runtime services.

Change the dmesg logging to describe it as simply "EfiMemoryMappedIO", not
as "reserved as EfiMemoryMappedIO".  A previous commit actually *does*
reserve the space if ACPI PNP0C01/02 devices haven't done so:

  - PCI: ECAM at [mem 0xe0000000-0xefffffff] reserved as EfiMemoryMappedIO
  + PCI: ECAM at [mem 0xe0000000-0xefffffff] is EfiMemoryMappedIO; assuming valid
    PCI: ECAM [mem 0xe0000000-0xefffffff] reserved to work around lack of ACPI motherboard _CRS

Link: https://lore.kernel.org/r/20231121183643.249006-3-helgaas@kernel.org
Tested-by: Tomasz Pala <gotar@polanet.pl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-05 10:56:33 -06:00
Bjorn Helgaas
070909e56a x86/pci: Reserve ECAM if BIOS didn't include it in PNP0C02 _CRS
Tomasz, Sebastian, and some Proxmox users reported problems initializing
ixgbe NICs.

I think the problem is that ECAM space described in the ACPI MCFG table is
not reserved via a PNP0C02 _CRS method as required by the PCI Firmware spec
(r3.3, sec 4.1.2), but it *is* included in the PNP0A03 host bridge _CRS as
part of the MMIO aperture.

If we allocate space for a PCI BAR, we're likely to allocate it from that
ECAM space, which obviously cannot work.

This could happen for any device, but in the ixgbe case it happens because
it's an SR-IOV device and the BIOS didn't allocate space for the VF BARs,
so Linux reallocated the bridge window leading to ixgbe and put it on top
of the ECAM space.  From Tomasz' system:

  PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
  PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] not reserved in ACPI motherboard resources
  pci_bus 0000:00: root bus resource [mem 0x80000000-0xfbffffff window]

  pci 0000:00:01.1: PCI bridge to [bus 02-03]
  pci 0000:00:01.1:   bridge window [mem 0xfb900000-0xfbbfffff]
  pci 0000:02:00.0: [8086:10fb] type 00 class 0x020000  # ixgbe
  pci 0000:02:00.0: reg 0x10: [mem 0xfba80000-0xfbafffff 64bit]
  pci 0000:02:00.0: VF(n) BAR0 space: [mem 0x00000000-0x000fffff 64bit] (contains BAR0 for 64 VFs)
  pci 0000:02:00.0: BAR 7: no space for [mem size 0x00100000 64bit]   # VF BAR 0

  pci_bus 0000:00: No. 2 try to assign unassigned res
  pci 0000:00:01.1: resource 14 [mem 0xfb900000-0xfbbfffff] released
  pci 0000:00:01.1: BAR 14: assigned [mem 0x80000000-0x806fffff]
  pci 0000:02:00.0: BAR 0: assigned [mem 0x80000000-0x8007ffff 64bit]
  pci 0000:02:00.0: BAR 7: assigned [mem 0x80204000-0x80303fff 64bit] # VF BAR 0

Fixes: 07eab0901ede ("efi/x86: Remove EfiMemoryMappedIO from E820 map")
Fixes: fd3a8cff4d4a ("x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space")
Reported-by: Tomasz Pala <gotar@polanet.pl>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218050
Reported-by: Sebastian Manciulea <manciuleas@protonmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218107
Link: https://forum.proxmox.com/threads/proxmox-8-kernel-6-2-16-4-pve-ixgbe-driver-fails-to-load-due-to-pci-device-probing-failure.131203/
Link: https://lore.kernel.org/r/20231121183643.249006-2-helgaas@kernel.org
Tested-by: Tomasz Pala <gotar@polanet.pl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org	# v6.2+
2023-12-05 10:56:07 -06:00
Vlastimil Babka
2a19be61a6 mm/slab: remove CONFIG_SLAB from all Kconfig and Makefile
Remove CONFIG_SLAB, CONFIG_DEBUG_SLAB, CONFIG_SLAB_DEPRECATED and
everything in Kconfig files and mm/Makefile that depends on those. Since
SLUB is the only remaining allocator, remove the allocator choice, make
CONFIG_SLUB a "def_bool y" for now and remove all explicit dependencies
on SLUB or SLAB as it's now always enabled. Make every option's verbose
name and description refer to "the slab allocator" without refering to
the specific implementation. Do not rename the CONFIG_ option names yet.

Everything under #ifdef CONFIG_SLAB, and mm/slab.c is now dead code, all
code under #ifdef CONFIG_SLUB is now always compiled.

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-12-05 11:14:40 +01:00
Borislav Petkov (AMD)
9c21ea53e6 x86/microcode/intel: Set new revision only after a successful update
This was meant to be done only when early microcode got updated
successfully. Move it into the if-branch.

Also, make sure the current revision is read unconditionally and only
once.

Fixes: 080990aa3344 ("x86/microcode: Rework early revisions reporting")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Link: https://lore.kernel.org/r/ZWjVt5dNRjbcvlzR@a4bf019067fa.jf.intel.com
2023-12-03 11:49:53 +01:00
Linus Torvalds
deb4b9dd3b xen: branch for v6.7-rc4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZWraswAKCRCAXGG7T9hj
 vlV0AP9241p7vHlIW6PIdfNZt9/dZZpuFnKHz+cE99pTZDl5nwEAoYqXUm/kEb14
 VAy7x0XIVdEt+9l4bgO2Qggx+Y184Qs=
 =APgm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

 - A fix for the Xen event driver setting the correct return value when
   experiencing an allocation failure

 - A fix for allocating space for a struct in the percpu area to not
   cross page boundaries (this one is for x86, a similar one for Arm was
   already in the pull request for rc3)

* tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/events: fix error code in xen_bind_pirq_msi_to_irq()
  x86/xen: fix percpu vcpu_info allocation
2023-12-03 08:31:53 +09:00
Borislav Petkov (AMD)
9b8493dc43 x86/CPU/AMD: Check vendor in the AMD microcode callback
Commit in Fixes added an AMD-specific microcode callback. However, it
didn't check the CPU vendor the kernel runs on explicitly.

The only reason the Zenbleed check in it didn't run on other x86 vendors
hardware was pure coincidental luck:

  if (!cpu_has_amd_erratum(c, amd_zenbleed))
	  return;

gives true on other vendors because they don't have those families and
models.

However, with the removal of the cpu_has_amd_erratum() in

  05f5f73936fa ("x86/CPU/AMD: Drop now unused CPU erratum checking function")

that coincidental condition is gone, leading to the zenbleed check
getting executed on other vendors too.

Add the explicit vendor check for the whole callback as it should've
been done in the first place.

Fixes: 522b1d69219d ("x86/cpu/amd: Add a Zenbleed fix")
Cc: <stable@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231201184226.16749-1-bp@alien8.de
2023-12-02 11:40:24 +01:00
Ilpo Järvinen
0d481ff35c x86/pci: Clean up open-coded PCIBIOS return code mangling
Per PCI Firmware spec r3.3, sec 2.5.2, 2.6.2, and 2.7, the return code for
these PCI BIOS interfaces is in 8 bits of the EAX register.  Previously it
was extracted by open-coded masks and shifting.

Name the return code bits with a #define and add pcibios_get_return_code()
to extract the return code to improve code readability. In addition,
replace zero test with PCIBIOS_SUCCESSFUL.

No functional changes intended.

Link: https://lore.kernel.org/r/20231124085924.13830-1-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-01 16:50:59 -06:00
Ilpo Järvinen
197e0da1f1 x86/pci: Use PCI_HEADER_TYPE_* instead of literals
Replace 0x7f and 0x80 literals with PCI_HEADER_TYPE_* defines.

Link: https://lore.kernel.org/r/20231124090919.23687-1-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-01 15:00:43 -06:00
Ashok Raj
1f693ef550 x86/microcode/intel: Remove redundant microcode late updated message
After successful update, the late loading routine prints an update
summary similar to:

  microcode: load: updated on 128 primary CPUs with 128 siblings
  microcode: revision: 0x21000170 -> 0x21000190

Remove the redundant message in the Intel side of the driver.

  [ bp: Massage commit message. ]

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/ZWjYhedNfhAUmt0k@a4bf019067fa.jf.intel.com
2023-12-01 18:52:01 +01:00
Like Xu
ef8d89033c KVM: x86: Remove 'return void' expression for 'void function'
The requested info will be stored in 'guest_xsave->region' referenced by
the incoming pointer "struct kvm_xsave *guest_xsave", thus there is no need
to explicitly use return void expression for a void function "static void
kvm_vcpu_ioctl_x86_get_xsave(...)". The issue is caught with [-Wpedantic].

Fixes: 2d287ec65e79 ("x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer")
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231007064019.17472-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01 08:14:27 -08:00
Sean Christopherson
087e15206d KVM: Set file_operations.owner appropriately for all such structures
Set .owner for all KVM-owned filed types so that the KVM module is pinned
until any files with callbacks back into KVM are completely freed.  Using
"struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
while there are active VMs, doesn't provide full protection.

Userspace can invoke delete_module() the instant the last reference to KVM
is put.  If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
then it's possible for KVM to be preempted and deleted/unloaded before KVM
fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
in, it will jump to a code page that is no longer mapped.

Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
THIS_MODULE (which points at kvm.ko).  KVM assumes that if /dev/kvm is
reachable, e.g. VMs are active, then the vendor module is loaded.

To reduce the probability of forgetting to set .owner entirely, use
THIS_MODULE for stats files where KVM does not call back into vendor code.

This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes
several other file types that have been buggy since their introduction.

Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"")
Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01 08:12:17 -08:00
Paolo Bonzini
e59f75de4e KVM: x86/mmu: fix comment about mmu_unsync_pages_lock
Fix the comment about what can and cannot happen when mmu_unsync_pages_lock
is not help.  The comment correctly mentions "clearing sp->unsync", but then
it talks about unsync going from 0 to 1.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-5-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01 07:52:09 -08:00
Paolo Bonzini
250ce1b4d2 KVM: x86/mmu: always take tdp_mmu_pages_lock
It is cheap to take tdp_mmu_pages_lock in all write-side critical sections.
We already do it all the time when zapping with read_lock(), so it is not
a problem to do it from the kvm_tdp_mmu_zap_all() path (aka
kvm_arch_flush_shadow_all(), aka VM destruction and MMU notifier release).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-4-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01 07:52:08 -08:00
Paolo Bonzini
484dd27c06 KVM: x86/mmu: remove unnecessary "bool shared" argument from iterators
The "bool shared" argument is more or less unnecessary in the
for_each_*_tdp_mmu_root_yield_safe() macros.  Many users check for
the lock before calling it; all of them either call small functions
that do the check, or end up calling tdp_mmu_set_spte_atomic() and
tdp_mmu_iter_set_spte().  Add a few assertions to make up for the
lost check in for_each_*_tdp_mmu_root_yield_safe(), but even this
is probably overkill and mostly for documentation reasons.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-3-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01 07:52:08 -08:00
Paolo Bonzini
5f3c8c9187 KVM: x86/mmu: remove unnecessary "bool shared" argument from functions
Neither tdp_mmu_next_root nor kvm_tdp_mmu_put_root need to know
if the lock is taken for read or write.  Either way, protection
is achieved via RCU and tdp_mmu_pages_lock.  Remove the argument
and just assert that the lock is taken.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-2-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01 07:52:07 -08:00
David Matlack
45a61ebb22 KVM: x86/mmu: Check for leaf SPTE when clearing dirty bit in the TDP MMU
Re-check that the given SPTE is still a leaf and present SPTE after a
failed cmpxchg in clear_dirty_gfn_range(). clear_dirty_gfn_range()
intends to only operate on present leaf SPTEs, but that could change
after a failed cmpxchg.

A check for present was added in commit 3354ef5a592d ("KVM: x86/mmu:
Check for present SPTE when clearing dirty bit in TDP MMU") but the
check for leaf is still buried in tdp_root_for_each_leaf_pte() and does
not get rechecked on retry.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20231027172640.2335197-3-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01 07:52:07 -08:00
David Matlack
1aa4bb9168 KVM: x86/mmu: Fix off-by-1 when splitting huge pages during CLEAR
Fix an off-by-1 error when passing in the range of pages to
kvm_mmu_try_split_huge_pages() during CLEAR_DIRTY_LOG. Specifically, end
is the last page that needs to be split (inclusive) so pass in `end + 1`
since kvm_mmu_try_split_huge_pages() expects the `end` to be
non-inclusive.

At worst this will cause a huge page to be write-protected instead of
eagerly split, which is purely a performance issue, not a correctness
issue. But even that is unlikely as it would require userspace pass in a
bitmap where the last page is the only 4K page on a huge page that needs
to be split.

Reported-by: Vipin Sharma <vipinsh@google.com>
Fixes: cb00a70bd4b7 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG")
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20231027172640.2335197-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01 07:51:55 -08:00
Philipp Stanner
573cc0e5cf KVM: x86: Harden copying of userspace-array against overflow
cpuid.c utilizes vmemdup_user() and array_size() to copy two userspace
arrays. This, currently, does not check for an overflow.

Use the new wrapper vmemdup_array_user() to copy the arrays more safely,
as vmemdup_user() doesn't check for overflow.

Note, KVM explicitly checks the number of entries before duplicating the
array, i.e. adding the overflow check should be a glorified nop.

Suggested-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Link: https://lore.kernel.org/r/20231102181526.43279-2-pstanner@redhat.com
[sean: call out that KVM pre-checks the number of entries]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 13:16:21 -08:00
Sean Christopherson
fd89499a51 KVM: x86/pmu: Track emulated counter events instead of previous counter
Explicitly track emulated counter events instead of using the common
counter value that's shared with the hardware counter owned by perf.
Bumping the common counter requires snapshotting the pre-increment value
in order to detect overflow from emulation, and the snapshot approach is
inherently flawed.

Snapshotting the previous counter at every increment assumes that there is
at most one emulated counter event per emulated instruction (or rather,
between checks for KVM_REQ_PMU).  That's mostly holds true today because
KVM only emulates (branch) instructions retired, but the approach will
fall apart if KVM ever supports event types that don't have a 1:1
relationship with instructions.

And KVM already has a relevant bug, as handle_invalid_guest_state()
emulates multiple instructions without checking KVM_REQ_PMU, i.e. could
miss an overflow event due to clobbering pmc->prev_counter.  Not checking
KVM_REQ_PMU is problematic in both cases, but at least with the emulated
counter approach, the resulting behavior is delayed overflow detection,
as opposed to completely lost detection.

Tracking the emulated count fixes another bug where the snapshot approach
can signal spurious overflow due to incorporating both the emulated count
and perf's count in the check, i.e. if overflow is detected by perf, then
KVM's emulation will also incorrectly signal overflow.  Add a comment in
the related code to call out the need to process emulated events *after*
pausing the perf event (big kudos to Mingwei for figuring out that
particular wrinkle).

Cc: Mingwei Zhang <mizhang@google.com>
Cc: Roman Kagan <rkagan@amazon.de>
Cc: Jim Mattson <jmattson@google.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20231103230541.352265-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:55 -08:00
Sean Christopherson
89acf1237b KVM: x86/pmu: Update sample period in pmc_write_counter()
Update a PMC's sample period in pmc_write_counter() to deduplicate code
across all callers of pmc_write_counter().  Opportunistically move
pmc_write_counter() into pmc.c now that it's doing more work.  WRMSR isn't
such a hot path that an extra CALL+RET pair will be problematic, and the
order of function definitions needs to be changed anyways, i.e. now is a
convenient time to eat the churn.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:55 -08:00
Sean Christopherson
ec61b2306d KVM: x86/pmu: Remove manual clearing of fields in kvm_pmu_init()
Remove code that unnecessarily clears event_count and need_cleanup in
kvm_pmu_init(), the entire kvm_pmu is zeroed just a few lines earlier.
Vendor code doesn't set event_count or need_cleanup during .init(), and
if either VMX or SVM did set those fields it would be a flagrant bug.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:54 -08:00
Sean Christopherson
f2f63f7ec6 KVM: x86/pmu: Stop calling kvm_pmu_reset() at RESET (it's redundant)
Drop kvm_vcpu_reset()'s call to kvm_pmu_reset(), the call is performed
only for RESET, which is really just the same thing as vCPU creation,
and kvm_arch_vcpu_create() *just* called kvm_pmu_init(), i.e. there can't
possibly be any work to do.

Unlike Intel, AMD's amd_pmu_refresh() does fill all_valid_pmc_idx even if
guest CPUID is empty, but everything that is at all dynamic is guaranteed
to be '0'/NULL, e.g. it should be impossible for KVM to have already
created a perf event.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:54 -08:00
Sean Christopherson
1647b52757 KVM: x86/pmu: Reset the PMU, i.e. stop counters, before refreshing
Stop all counters and release all perf events before refreshing the vPMU,
i.e. before reconfiguring the vPMU to respond to changes in the vCPU
model.

Clear need_cleanup in kvm_pmu_reset() as well so that KVM doesn't
prematurely stop counters, e.g. if KVM enters the guest and enables
counters before the vCPU is scheduled out.

Cc: stable@vger.kernel.org
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:54 -08:00
Sean Christopherson
cbb359d81a KVM: x86/pmu: Move PMU reset logic to common x86 code
Move the common (or at least "ignored") aspects of resetting the vPMU to
common x86 code, along with the stop/release helpers that are no used only
by the common pmu.c.

There is no need to manually handle fixed counters as all_valid_pmc_idx
tracks both fixed and general purpose counters, and resetting the vPMU is
far from a hot path, i.e. the extra bit of overhead to the PMC from the
index is a non-issue.

Zero fixed_ctr_ctrl in common code even though it's Intel specific.
Ensuring it's zero doesn't harm AMD/SVM in any way, and stopping the fixed
counters via all_valid_pmc_idx, but not clearing the associated control
bits, would be odd/confusing.

Make the .reset() hook optional as SVM no longer needs vendor specific
handling.

Cc: stable@vger.kernel.org
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:54 -08:00
Uros Bizjak
15223c4f97 KVM: SVM,VMX: Use %rip-relative addressing to access kvm_rebooting
Instruction with %rip-relative address operand is one byte shorter than
its absolute address counterpart and is also compatible with position
independent executable (-fpie) build.

No functional changes intended.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231031075312.47525-1-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:51:54 -08:00
Sean Christopherson
72046d0a07 KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled
When vNMI is enabled, rely entirely on hardware to correctly handle NMI
blocking, i.e. don't intercept IRET to detect when NMIs are no longer
blocked.  KVM already correctly ignores svm->nmi_masked when vNMI is
enabled, so the effect of the bug is essentially an unnecessary VM-Exit.

KVM intercepts IRET for two reasons:
 - To track NMI masking to be able to know at any point of time if NMI
   is masked.
 - To track NMI windows (to inject another NMI after the guest executes
   IRET, i.e. unblocks NMIs)

When vNMI is enabled, both cases are handled by hardware:
- NMI masking state resides in int_ctl.V_NMI_BLOCKING and can be read by
  KVM at will.
- Hardware automatically "injects" pending virtual NMIs when virtual NMIs
  become unblocked.

However, even though pending a virtual NMI for hardware to handle is the
most common way to synthesize a guest NMI, KVM may still directly inject
an NMI via when KVM is handling two "simultaneous" NMIs (see comments in
process_nmi() for details on KVM's simultaneous NMI handling).  Per AMD's
APM, hardware sets the BLOCKING flag when software directly injects an NMI
as well, i.e. KVM doesn't need to manually mark vNMIs as blocked:

  If Event Injection is used to inject an NMI when NMI Virtualization is
  enabled, VMRUN sets V_NMI_MASK in the guest state.

Note, it's still possible that KVM could trigger a spurious IRET VM-Exit.
When running a nested guest, KVM disables vNMI for L2 and thus will enable
IRET interception (in both vmcb01 and vmcb02) while running L2 reason.  If
a nested VM-Exit happens before L2 executes IRET, KVM can end up running
L1 with vNMI enable and IRET intercepted.  This is also a benign bug, and
even less likely to happen, i.e. can be safely punted to a future fix.

Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
Link: https://lore.kernel.org/all/ZOdnuDZUd4mevCqe@google.como
Cc: Santosh Shukla <santosh.shukla@amd.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <santosh.shukla@amd.com>
Link: https://lore.kernel.org/r/20231018192021.1893261-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:51:22 -08:00
Sean Christopherson
770d6aa2e4 KVM: SVM: Explicitly require FLUSHBYASID to enable SEV support
Add a sanity check that FLUSHBYASID is available if SEV is supported in
hardware, as SEV (and beyond) guests are bound to a single ASID, i.e. KVM
can't "flush" by assigning a new, fresh ASID to the guest.  If FLUSHBYASID
isn't supported for some bizarre reason, KVM would completely fail to do
TLB flushes for SEV+ guests (see pre_svm_run() and pre_sev_run()).

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20231018193617.1895752-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:51:14 -08:00
Sean Christopherson
176bfc5b17 KVM: nSVM: Advertise support for flush-by-ASID
Advertise support for FLUSHBYASID when nested SVM is enabled, as KVM can
always emulate flushing TLB entries for a vmcb12 ASID, e.g. by running L2
with a new, fresh ASID in vmcb02.  Some modern hypervisors, e.g. VMWare
Workstation 17, require FLUSHBYASID support and will refuse to run if it's
not present.

Punt on proper support, as "Honor L1's request to flush an ASID on nested
VMRUN" is one of the TODO items in the (incomplete) list of issues that
need to be addressed in order for KVM to NOT do a full TLB flush on every
nested SVM transition (see nested_svm_transition_tlb_flush()).

Reported-by: Stefan Sterz <s.sterz@proxmox.com>
Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231018194104.1896415-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:50:25 -08:00
Sean Christopherson
a484755ab2 Revert "nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB"
Revert KVM's made-up consistency check on SVM's TLB control.  The APM says
that unsupported encodings are reserved, but the APM doesn't state that
VMRUN checks for a supported encoding.  Unless something is called out
in "Canonicalization and Consistency Checks" or listed as MBZ (Must Be
Zero), AMD behavior is typically to let software shoot itself in the foot.

This reverts commit 174a921b6975ef959dd82ee9e8844067a62e3ec1.

Fixes: 174a921b6975 ("nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB")
Reported-by: Stefan Sterz <s.sterz@proxmox.com>
Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231018194104.1896415-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:47:43 -08:00
Sean Christopherson
c52ffadc65 KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug
Don't force a masterclock update when a vCPU synchronizes to the current
TSC generation, e.g. when userspace hotplugs a pre-created vCPU into the
VM.  Unnecessarily updating the masterclock is undesirable as it can cause
kvmclock's time to jump, which is particularly painful on systems with a
stable TSC as kvmclock _should_ be fully reliable on such systems.

The unexpected time jumps are due to differences in the TSC=>nanoseconds
conversion algorithms between kvmclock and the host's CLOCK_MONOTONIC_RAW
(the pvclock algorithm is inherently lossy).  When updating the
masterclock, KVM refreshes the "base", i.e. moves the elapsed time since
the last update from the kvmclock/pvclock algorithm to the
CLOCK_MONOTONIC_RAW algorithm.  Synchronizing kvmclock with
CLOCK_MONOTONIC_RAW is the lesser of evils when the TSC is unstable, but
adds no real value when the TSC is stable.

Prior to commit 7f187922ddf6 ("KVM: x86: update masterclock values on TSC
writes"), KVM did NOT force an update when synchronizing a vCPU to the
current generation.

  commit 7f187922ddf6b67f2999a76dcb71663097b75497
  Author: Marcelo Tosatti <mtosatti@redhat.com>
  Date:   Tue Nov 4 21:30:44 2014 -0200

    KVM: x86: update masterclock values on TSC writes

    When the guest writes to the TSC, the masterclock TSC copy must be
    updated as well along with the TSC_OFFSET update, otherwise a negative
    tsc_timestamp is calculated at kvm_guest_time_update.

    Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
    "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
    becomes redundant, so remove the do_request boolean and collapse
    everything into a single condition.

Before that, KVM only re-synced the masterclock if the masterclock was
enabled or disabled  Note, at the time of the above commit, VMX
synchronized TSC on *guest* writes to MSR_IA32_TSC:

        case MSR_IA32_TSC:
                kvm_write_tsc(vcpu, msr_info);
                break;

which is why the changelog specifically says "guest writes", but the bug
that was being fixed wasn't unique to guest write, i.e. a TSC write from
the host would suffer the same problem.

So even though KVM stopped synchronizing on guest writes as of commit
0c899c25d754 ("KVM: x86: do not attempt TSC synchronization on guest
writes"), simply reverting commit 7f187922ddf6 is not an option.  Figuring
out how a negative tsc_timestamp could be computed requires a bit more
sleuthing.

In kvm_write_tsc() (at the time), except for KVM's "less than 1 second"
hack, KVM snapshotted the vCPU's current TSC *and* the current time in
nanoseconds, where kvm->arch.cur_tsc_nsec is the current host kernel time
in nanoseconds:

        ns = get_kernel_ns();

        ...

        if (usdiff < USEC_PER_SEC &&
            vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
                ...
        } else {
                /*
                 * We split periods of matched TSC writes into generations.
                 * For each generation, we track the original measured
                 * nanosecond time, offset, and write, so if TSCs are in
                 * sync, we can match exact offset, and if not, we can match
                 * exact software computation in compute_guest_tsc()
                 *
                 * These values are tracked in kvm->arch.cur_xxx variables.
                 */
                kvm->arch.cur_tsc_generation++;
                kvm->arch.cur_tsc_nsec = ns;
                kvm->arch.cur_tsc_write = data;
                kvm->arch.cur_tsc_offset = offset;
                matched = false;
                pr_debug("kvm: new tsc generation %llu, clock %llu\n",
                         kvm->arch.cur_tsc_generation, data);
        }

        ...

        /* Keep track of which generation this VCPU has synchronized to */
        vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
        vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
        vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;

Note that the above creates a new generation and sets "matched" to false!
But because kvm_track_tsc_matching() looks for matched+1, i.e. doesn't
require the vCPU that creates the new generation to match itself, KVM
would immediately compute vcpus_matched as true for VMs with a single vCPU.
As a result, KVM would skip the masterlock update, even though a new TSC
generation was created:

        vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
                         atomic_read(&vcpu->kvm->online_vcpus));

        if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
                if (!ka->use_master_clock)
                        do_request = 1;

        if (!vcpus_matched && ka->use_master_clock)
                        do_request = 1;

        if (do_request)
                kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);

On hardware without TSC scaling support, vcpu->tsc_catchup is set to true
if the guest TSC frequency is faster than the host TSC frequency, even if
the TSC is otherwise stable.  And for that mode, kvm_guest_time_update(),
by way of compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the
kernel time at the last TSC write, to compute the guest TSC relative to
kernel time:

  static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
  {
        u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
                                      vcpu->arch.virtual_tsc_mult,
                                      vcpu->arch.virtual_tsc_shift);
        tsc += vcpu->arch.this_tsc_write;
        return tsc;
  }

Except the "kernel_ns" passed to compute_guest_tsc() isn't the current
kernel time, it's the masterclock snapshot!

        spin_lock(&ka->pvclock_gtod_sync_lock);
        use_master_clock = ka->use_master_clock;
        if (use_master_clock) {
                host_tsc = ka->master_cycle_now;
                kernel_ns = ka->master_kernel_ns;
        }
        spin_unlock(&ka->pvclock_gtod_sync_lock);

        if (vcpu->tsc_catchup) {
                u64 tsc = compute_guest_tsc(v, kernel_ns);
                if (tsc > tsc_timestamp) {
                        adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
                        tsc_timestamp = tsc;
                }
        }

And so when KVM skips the masterclock update after a TSC write, i.e. after
a new TSC generation is started, the "kernel_ns-vcpu->arch.this_tsc_nsec"
is *guaranteed* to generate a negative value, because this_tsc_nsec was
captured after ka->master_kernel_ns.

Forcing a masterclock update essentially fudged around that problem, but
in a heavy handed way that introduced undesirable side effects, i.e.
unnecessarily forces a masterclock update when a new vCPU joins the party
via hotplug.

Note, KVM forces masterclock updates in other weird ways that are also
likely unnecessary, e.g. when establishing a new Xen shared info page and
when userspace creates a brand new vCPU.  But the Xen thing is firmly a
separate mess, and there are no known userspace VMMs that utilize kvmclock
*and* create new vCPUs after the VM is up and running.  I.e. the other
issues are future problems.

Reported-by: Dongli Zhang <dongli.zhang@oracle.com>
Closes: https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com
Fixes: 7f187922ddf6 ("KVM: x86: update masterclock values on TSC writes")
Cc: David Woodhouse <dwmw2@infradead.org>
Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com>
Tested-by: Dongli Zhang <dongli.zhang@oracle.com>
Link: https://lore.kernel.org/r/20231018195638.1898375-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:47:06 -08:00
Jim Mattson
80c883db87 KVM: x86: Use a switch statement and macros in __feature_translate()
Use a switch statement with macro-generated case statements to handle
translating feature flags in order to reduce the probability of runtime
errors due to copy+paste goofs, to make compile-time errors easier to
debug, and to make the code more readable.

E.g. the compiler won't directly generate an error for duplicate if
statements

	if (x86_feature == X86_FEATURE_SGX1)
		return KVM_X86_FEATURE_SGX1;
	else if (x86_feature == X86_FEATURE_SGX2)
		return KVM_X86_FEATURE_SGX1;

and so instead reverse_cpuid_check() will fail due to the untranslated
entry pointing at a Linux-defined leaf, which provides practically no
hint as to what is broken

  arch/x86/kvm/reverse_cpuid.h:108:2: error: call to __compiletime_assert_450 declared with 'error' attribute:
                                      BUILD_BUG_ON failed: x86_leaf == CPUID_LNX_4
          BUILD_BUG_ON(x86_leaf == CPUID_LNX_4);
          ^
whereas duplicate case statements very explicitly point at the offending
code:

  arch/x86/kvm/reverse_cpuid.h:125:2: error: duplicate case value '361'
          KVM_X86_TRANSLATE_FEATURE(SGX2);
          ^
  arch/x86/kvm/reverse_cpuid.h:124:2: error: duplicate case value '360'
          KVM_X86_TRANSLATE_FEATURE(SGX1);
          ^

And without macros, the opposite type of copy+paste goof doesn't generate
any error at compile-time, e.g. this yields no complaints:

        case X86_FEATURE_SGX1:
                return KVM_X86_FEATURE_SGX1;
        case X86_FEATURE_SGX2:
                return KVM_X86_FEATURE_SGX1;

Note, __feature_translate() is forcibly inlined and the feature is known
at compile-time, so the code generation between an if-elif sequence and a
switch statement should be identical.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20231024001636.890236-2-jmattson@google.com
[sean: use a macro, rewrite changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:27:02 -08:00
Jim Mattson
eefe5e6682 KVM: x86: Advertise CPUID.(EAX=7,ECX=2):EDX[5:0] to userspace
The low five bits {INTEL_PSFD, IPRED_CTRL, RRSBA_CTRL, DDPD_U, BHI_CTRL}
advertise the availability of specific bits in IA32_SPEC_CTRL. Since KVM
dynamically determines the legal IA32_SPEC_CTRL bits for the underlying
hardware, the hard work has already been done. Just let userspace know
that a guest can use these IA32_SPEC_CTRL bits.

The sixth bit (MCDT_NO) states that the processor does not exhibit MXCSR
Configuration Dependent Timing (MCDT) behavior. This is an inherent
property of the physical processor that is inherited by the virtual
CPU. Pass that information on to userspace.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20231024001636.890236-1-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 11:50:16 -08:00
Sean Christopherson
75bedc1ee9 KVM: x86: Turn off KVM_WERROR by default for all configs
Don't enable KVM_WERROR by default for x86-64 builds as KVM's one-off
-Werror enabling is *mostly* superseded by the kernel-wide WERROR, and
enabling KVM_WERROR by default can cause problems for developers working
on other subsystems.  E.g. subsystems that have a "zero W=1 regressions"
rule can inadvertently build KVM with -Werror and W=1, and end up with
build failures that are completely uninteresting to the developer (W=1 is
prone to false positives, especially on older compilers).

Keep KVM_WERROR as there are combinations where enabling WERROR isn't
feasible, e.g. the default FRAME_WARN=1024 on i386 builds generates a
non-zero number of warnings and thus errors, and there are far too many
warnings throughout the kernel to enable WERROR with W=1 (building KVM
with -Werror is desirable (with a sane compiler) as W=1 does generate
useful warnings).

Opportunistically drop the dependency on !COMPILE_TEST as it's completely
meaningless (it was copied from i195's -Werror Kconfig), as the kernel's
WERROR is explicitly *enabled* for COMPILE_TEST=y kernel's, i.e. enabling
-Werror is obviosly not dependent on COMPILE_TEST=n.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/all/20231006205415.3501535-1-kuba@kernel.org
Link: https://lore.kernel.org/r/20231018151906.1841689-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 11:48:39 -08:00
Alexander Antonov
fdd041028f perf/x86/intel/uncore: Factor out topology_gidnid_map()
The same code is used for retrieving package ID procedure from GIDNIDMAP
register. Factor out topology_gidnid_map() to avoid code duplication.

Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20231127185246.2371939-3-alexander.antonov@linux.intel.com
2023-11-30 14:29:53 +01:00
Alexander Antonov
1692cf434b perf/x86/intel/uncore: Fix NULL pointer dereference issue in upi_fill_topology()
Get logical socket id instead of physical id in discover_upi_topology()
to avoid out-of-bound access on 'upi = &type->topology[nid][idx];' line
that leads to NULL pointer dereference in upi_fill_topology()

Fixes: f680b6e6062e ("perf/x86/intel/uncore: Enable UPI topology discovery for Icelake Server")
Reported-by: Kyle Meyer <kyle.meyer@hpe.com>
Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://lore.kernel.org/r/20231127185246.2371939-2-alexander.antonov@linux.intel.com
2023-11-30 14:29:52 +01:00
Ashwin Dayanand Kamat
27d25348d4 x86/sev: Fix kernel crash due to late update to read-only ghcb_version
A write-access violation page fault kernel crash was observed while running
cpuhotplug LTP testcases on SEV-ES enabled systems. The crash was
observed during hotplug, after the CPU was offlined and the process
was migrated to different CPU. setup_ghcb() is called again which
tries to update ghcb_version in sev_es_negotiate_protocol(). Ideally this
is a read_only variable which is initialised during booting.

Trying to write it results in a pagefault:

  BUG: unable to handle page fault for address: ffffffffba556e70
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0003) - permissions violation
  [ ...]
  Call Trace:
   <TASK>
   ? __die_body.cold+0x1a/0x1f
   ? __die+0x2a/0x35
   ? page_fault_oops+0x10c/0x270
   ? setup_ghcb+0x71/0x100
   ? __x86_return_thunk+0x5/0x6
   ? search_exception_tables+0x60/0x70
   ? __x86_return_thunk+0x5/0x6
   ? fixup_exception+0x27/0x320
   ? kernelmode_fixup_or_oops+0xa2/0x120
   ? __bad_area_nosemaphore+0x16a/0x1b0
   ? kernel_exc_vmm_communication+0x60/0xb0
   ? bad_area_nosemaphore+0x16/0x20
   ? do_kern_addr_fault+0x7a/0x90
   ? exc_page_fault+0xbd/0x160
   ? asm_exc_page_fault+0x27/0x30
   ? setup_ghcb+0x71/0x100
   ? setup_ghcb+0xe/0x100
   cpu_init_exception_handling+0x1b9/0x1f0

The fix is to call sev_es_negotiate_protocol() only in the BSP boot phase,
and it only needs to be done once in any case.

[ mingo: Refined the changelog. ]

Fixes: 95d33bfaa3e1 ("x86/sev: Register GHCB memory when SEV-SNP is active")
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Bo Gan <bo.gan@broadcom.com>
Signed-off-by: Bo Gan <bo.gan@broadcom.com>
Signed-off-by: Ashwin Dayanand Kamat <ashwin.kamat@broadcom.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/1701254429-18250-1-git-send-email-kashwindayan@vmware.com
2023-11-30 10:23:12 +01:00
Jun'ichi Nomura
78a509fba9 x86/boot: Ignore NMIs during very early boot
When there are two racing NMIs on x86, the first NMI invokes NMI handler and
the 2nd NMI is latched until IRET is executed.

If panic on NMI and panic kexec are enabled, the first NMI triggers
panic and starts booting the next kernel via kexec. Note that the 2nd
NMI is still latched. During the early boot of the next kernel, once
an IRET is executed as a result of a page fault, then the 2nd NMI is
unlatched and invokes the NMI handler.

However, NMI handler is not set up at the early stage of boot, which
results in a boot failure.

Avoid such problems by setting up a NOP handler for early NMIs.

[ mingo: Refined the changelog. ]

Signed-off-by: Jun'ichi Nomura <junichi.nomura@nec.com>
Signed-off-by: Derek Barbosa <debarbos@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2023-11-30 09:55:40 +01:00
Nathan Chancellor
5225952d74 x86/tools: Remove chkobjdump.awk
This check is superfluous now that the minimum version of binutils to
build the kernel is 2.25. This also fixes an error seen with
llvm-objdump because it does not support '-v' prior to LLVM 13:

  llvm-objdump: error: unknown argument '-v'

Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: dde24a87c5
Link: https://lore.kernel.org/r/20231129-objdump-reformat-llvm-v3-3-0d855e79314d@kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/1362
2023-11-30 09:38:12 +01:00
Samuel Zeter
f4570ebd83 x86/tools: objdump_reformat.awk: Allow for spaces
GNU objdump and LLVM objdump have differing output formats.
Specifically, GNU objump will format its output as: address:<tab>hex,
whereas LLVM objdump displays its output as address:<space>hex.

objdump_reformat.awk incorrectly handles this discrepancy due to
the unexpected space and as a result insn_decoder_test fails, as
its input is garbled.

The instruction line being tokenized now handles a space and colon,
or tab delimiter.

Signed-off-by: Samuel Zeter <samuelzeter@gmail.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20231129-objdump-reformat-llvm-v3-2-0d855e79314d@kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/1364
2023-11-30 09:38:10 +01:00
Samuel Zeter
60c2ea7c89 x86/tools: objdump_reformat.awk: Ensure regex matches fwait
If there is "wait" mnemonic in the line being parsed, it is incorrectly
handled by the script, and an extra line of "fwait" in
objdump_reformat's output is inserted. As insn_decoder_test relies upon
the formatted output, the test fails.

This is reproducible when disassembling with llvm-objdump:

Pre-processed lines:

  ffffffff81033e72: 9b                    wait
  ffffffff81033e73: 48 c7 c7 89 50 42 82  movq

After objdump_reformat.awk:

  ffffffff81033e72:       9b      fwait
  ffffffff81033e72:                               wait
  ffffffff81033e73:       48 c7 c7 89 50 42 82    movq

The regex match now accepts spaces or tabs, along with the "fwait"
instruction.

Signed-off-by: Samuel Zeter <samuelzeter@gmail.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20231129-objdump-reformat-llvm-v3-1-0d855e79314d@kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/1364
2023-11-30 09:38:09 +01:00
Namhyung Kim
0f9e0d7928 perf/x86/amd: Reject branch stack for IBS events
The AMD IBS PMU doesn't handle branch stacks, so it should not accept
events with brstack.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231130062246.290-1-ravi.bangoria@amd.com
2023-11-30 09:34:40 +01:00
Sean Christopherson
0277022a77 KVM: x86/mmu: Declare flush_remote_tlbs{_range}() hooks iff HYPERV!=n
Declare the kvm_x86_ops hooks used to wire up paravirt TLB flushes when
running under Hyper-V if and only if CONFIG_HYPERV!=n.  Wrapping yet more
code with IS_ENABLED(CONFIG_HYPERV) eliminates a handful of conditional
branches, and makes it super obvious why the hooks *might* be valid.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231018192325.1893896-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-29 14:44:47 -08:00
Like Xu
547c91929f KVM: x86: Get CPL directly when checking if loaded vCPU is in kernel mode
When querying whether or not a vCPU "is" running in kernel mode, directly
get the CPL if the vCPU is the currently loaded vCPU.  In scenarios where
a guest is profiled via perf-kvm, querying vcpu->arch.preempted_in_kernel
from kvm_guest_state() is wrong if vCPU is actively running, i.e. isn't
scheduled out due to being preempted and so preempted_in_kernel is stale.

This affects perf/core's ability to accurately tag guest RIP with
PERF_RECORD_MISC_GUEST_{KERNEL|USER} and record it in the sample.  This
causes perf/tool to fail to connect the vCPU RIPs to the guest kernel
space symbols when parsing these samples due to incorrect PERF_RECORD_MISC
flags:

   Before (perf-report of a cpu-cycles sample):
      1.23%  :58945   [unknown]         [u] 0xffffffff818012e0

   After:
      1.35%  :60703   [kernel.vmlinux]  [g] asm_exc_page_fault

Note, checking preempted_in_kernel in kvm_arch_vcpu_in_kernel() is awful
as nothing in the API's suggests that it's safe to use if and only if the
vCPU was preempted.  That can be cleaned up in the future, for now just
fix the glaring correctness bug.

Note #2, checking vcpu->preempted is NOT safe, as getting the CPL on VMX
requires VMREAD, i.e. is correct if and only if the vCPU is loaded.  If
the target vCPU *was* preempted, then it can be scheduled back in after
the check on vcpu->preempted in kvm_vcpu_on_spin(), i.e. KVM could end up
trying to do VMREAD on a VMCS that isn't loaded on the current pCPU.

Signed-off-by: Like Xu <likexu@tencent.com>
Fixes: e1bfc24577cc ("KVM: Move x86's perf guest info callbacks to generic KVM")
Link: https://lore.kernel.org/r/20231123075818.12521-1-likexu@tencent.com
[sean: massage changelong, add Fixes]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-29 10:22:55 -08:00