linux

iv/linux

Author	SHA1	Message	Date
Yan Zhao	c9f65a3f2d	KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED For KVM_X86_QUIRK_CD_NW_CLEARED is on, remove the IPAT (ignore PAT) bit in EPT memory types when cache is disabled and non-coherent DMA are present. To correctly emulate CR0.CD=1, UC + IPAT are required as memtype in EPT. However, as with commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), WB + IPAT are now returned to workaround a BIOS issue that guest MTRRs are enabled too late. Without this workaround, a super slow guest boot-up is expected during the pre-guest-MTRR-enabled period due to UC as the effective memory type for all guest memory. Absent emulating CR0.CD=1 with UC, it makes no sense to set IPAT when KVM is honoring the guest memtype. Removing the IPAT bit in this patch allows effective memory type to honor PAT values as well, as WB is the weakest memtype. It means if a guest explicitly claims UC as the memtype in PAT, the effective memory is UC instead of previous WB. If, for some unknown reason, a guest meets a slow boot-up issue with the removal of IPAT, it's desired to fix the blamed PAT in the guest. Returning guest MTRR type as if CR0.CD=0 is also not preferred because KVMs ABI for the quirk also requires KVM to force WB memtype regardless of guest MTRRs to workaround the slow guest boot-up issue. In the future, honoring guest PAT will also allow KVM to more precisely zap SPTEs when the effective memtype changes. E.g. by not forcing WB when CR0.CD=1, instead of zapping SPTEs when guest MTRRs change, KVM can skip MTRR-induced zaps if CR0.CD=1 and zap SPTEs for non-WB MTRR ranges when CR0.CD is toggled (WB MTRR SPTEs can be kept because they're WB regardless of CR0.CD). The change of removing IPAT has been verified with normal boot-up time on old OVMF of commit c9e5618f84b0cb54a9ac2d7604f7b7e7859b45a7 as well, dated back to Apr 14 2015. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065326.20557-1-yan.y.zhao@intel.com [sean: massage changelog to apply patch without full series] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-10 17:04:42 -07:00
Yan Zhao	362ff6dca5	KVM: x86/mmu: Zap KVM TDP when noncoherent DMA assignment starts/stops Zap KVM TDP when noncoherent DMA assignment starts (noncoherent dma count transitions from 0 to 1) or stops (noncoherent dma count transitions from 1 to 0). Before the zap, test if guest MTRR is to be honored after the assignment starts or was honored before the assignment stops. When there's no noncoherent DMA device, EPT memory type is ((MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) \| VMX_EPT_IPAT_BIT) When there're noncoherent DMA devices, EPT memory type needs to honor guest CR0.CD and MTRR settings. So, if noncoherent DMA count transitions between 0 and 1, EPT leaf entries need to be zapped to clear stale memory type. This issue might be hidden when the device is statically assigned with VFIO adding/removing MMIO regions of the noncoherent DMA devices for several times during guest boot, and current KVM MMU will call kvm_mmu_zap_all_fast() on the memslot removal. But if the device is hot-plugged, or if the guest has mmio_always_on for the device, the MMIO regions of it may only be added for once, then there's no path to do the EPT entries zapping to clear stale memory type. Therefore do the EPT zapping when noncoherent assignment starts/stops to ensure stale entries cleaned away. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065223.20432-1-yan.y.zhao@intel.com [sean: fix misspelled words in comment and changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-10 17:04:38 -07:00
Joel Granados	ecd5d66383	x86/vdso: Remove now superfluous sentinel element from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from abi_table2. This removal is safe because register_sysctl implicitly uses ARRAY_SIZE() in addition to checking for the sentinel. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-10-10 15:22:02 -07:00
Joel Granados	83e291d3f5	arch/x86: Remove now superfluous sentinel elem from ctl_table arrays This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from sld_sysctl and itmt_kern_table. This removal is safe because register_sysctl_init and register_sysctl implicitly use the array size in addition to checking for the sentinel. Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86 Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-10-10 15:22:02 -07:00
Linus Torvalds	b711538a40	hyperv-fixes for v6.6-rc6 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmUk4fcTHHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXhhqCACWsBYTB0EJ3oMJnzfnHeuN418ZDx/O AL0k0O5MT6roEFmvGUhzJ/jsoxL+W+Wj3aFwzReyOSQpgjTTF/Ja26LPvxRzDxKi sZPojnR2ykW31l7y+eh1p9qSM/aYvTMDP5zO7L1fBnWMAGMv8w8RezpCJ7bh4BgA FTMZZrvKYVT9hCGkYqKUZGBtDTPZ56WE+MCiRxTWQvF+4QKaIff0tpno8V7203bE D/b4+Ouh19RXFTC5dUq/0JtAdV2AadrPHnScUupc8Hk/MMFiU5CzvH4bAqiwXBcU YqqlD3kZbIqqbKE93+03jvyrRDvDGlq+rpA3KMk5MBAfrkM4DytpWvMs =SVq1 -----END PGP SIGNATURE----- Merge tag 'hyperv-fixes-signed-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - fixes for Hyper-V VTL code (Saurabh Sengar and Olaf Hering) - fix hv_kvp_daemon to support keyfile based connection profile (Shradha Gupta) * tag 'hyperv-fixes-signed-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hv/hv_kvp_daemon:Support for keyfile based connection profile hyperv: reduce size of ms_hyperv_info x86/hyperv: Add common print prefix "Hyper-V" in hv_init x86/hyperv: Remove hv_vtl_early_init initcall x86/hyperv: Restrict get_vtl to only VTL platforms	2023-10-10 11:01:21 -07:00
Thomas Gleixner	48525fd1ea	x86/cpu: Provide debug interface Provide debug files which dump the topology related information of cpuinfo_x86. This is useful to validate the upcoming conversion of the topology evaluation for correctness or bug compatibility. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085113.353191313@linutronix.de	2023-10-10 14:38:19 +02:00
Thomas Gleixner	90781f0c4c	x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids Per CPU cpuinfo is used to persist the logical package and die IDs. That's really not the right place simply because cpuinfo is subject to be reinitialized when a CPU goes through an offline/online cycle. This works by chance today, but that's far from correct and neither obvious nor documented. Add a per cpu datastructure which persists those logical IDs, which allows to cleanup the CPUID evaluation code. This is a temporary workaround until the larger topology management is in place, which makes all of this logical management mechanics obsolete. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085113.292947071@linutronix.de	2023-10-10 14:38:19 +02:00
Thomas Gleixner	db4a4086a2	x86/apic: Use u32 for wakeup_secondary_cpu[_64]() APIC IDs are used with random data types u16, u32, int, unsigned int, unsigned long. Make it all consistently use u32 because that reflects the hardware register width. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085113.233274223@linutronix.de	2023-10-10 14:38:19 +02:00
Thomas Gleixner	59f7928cd4	x86/apic: Use u32 for [gs]et_apic_id() APIC IDs are used with random data types u16, u32, int, unsigned int, unsigned long. Make it all consistently use u32 because that reflects the hardware register width. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085113.172569282@linutronix.de	2023-10-10 14:38:19 +02:00
Thomas Gleixner	01ccf9bbd2	x86/apic: Use u32 for phys_pkg_id() APIC IDs are used with random data types u16, u32, int, unsigned int, unsigned long. Make it all consistently use u32 because that reflects the hardware register width even if that callback going to be removed soonish. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085113.113097126@linutronix.de	2023-10-10 14:38:19 +02:00
Thomas Gleixner	8aa2a4178d	x86/apic: Use u32 for cpu_present_to_apicid() APIC IDs are used with random data types u16, u32, int, unsigned int, unsigned long. Make it all consistently use u32 because that reflects the hardware register width and fixup a few related usage sites for consistency sake. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085113.054064391@linutronix.de	2023-10-10 14:38:19 +02:00
Thomas Gleixner	5d376b8fb1	x86/apic: Use u32 for check_apicid_used() APIC IDs are used with random data types u16, u32, int, unsigned int, unsigned long. Make it all consistently use u32 because that reflects the hardware register width and move the default implementation to local.h as there are no users outside the apic directory. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.981956102@linutronix.de	2023-10-10 14:38:18 +02:00
Thomas Gleixner	4705243d23	x86/apic: Use u32 for APIC IDs in global data APIC IDs are used with random data types u16, u32, int, unsigned int, unsigned long. Make it all consistently use u32 because that reflects the hardware register width and fixup the most obvious usage sites of that. The APIC callbacks will be addressed separately. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.922905727@linutronix.de	2023-10-10 14:38:18 +02:00
Thomas Gleixner	9ff4275bc8	x86/apic: Use BAD_APICID consistently APIC ID checks compare with BAD_APICID all over the place, but some initializers and some code which fiddles with global data structure use -1[U] instead. That simply cannot work at all. Fix it up and use BAD_APICID consistently all over the place. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.862835121@linutronix.de	2023-10-10 14:38:18 +02:00
Thomas Gleixner	6e29032340	x86/cpu: Move cpu_l[l2]c_id into topology info The topology IDs which identify the LLC and L2 domains clearly belong to the per CPU topology information. Move them into cpuinfo_x86::cpuinfo_topo and get rid of the extra per CPU data and the related exports. This also paves the way to do proper topology evaluation during early boot because it removes the only per CPU dependency for that. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.803864641@linutronix.de	2023-10-10 14:38:18 +02:00
Thomas Gleixner	22dc963162	x86/cpu: Move logical package and die IDs into topology info Yet another topology related data pair. Rename logical_proc_id to logical_pkg_id so it fits the common naming conventions. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.745139505@linutronix.de	2023-10-10 14:38:18 +02:00
Thomas Gleixner	594957d723	x86/cpu: Remove pointless evaluation of x86_coreid_bits cpuinfo_x86::x86_coreid_bits is only used by the AMD numa topology code. No point in evaluating it on non AMD systems. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.687588373@linutronix.de	2023-10-10 14:38:18 +02:00
Thomas Gleixner	e3c0c5d52a	x86/cpu: Move cu_id into topology info No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.628405546@linutronix.de	2023-10-10 14:38:18 +02:00
Thomas Gleixner	e95256335d	x86/cpu: Move cpu_core_id into topology info Rename it to core_id and stick it to the other ID fields. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.566519388@linutronix.de	2023-10-10 14:38:17 +02:00
Thomas Gleixner	8a169ed40f	x86/cpu: Move cpu_die_id into topology info Move the next member. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.388185134@linutronix.de	2023-10-10 14:38:17 +02:00
Thomas Gleixner	02fb601d27	x86/cpu: Move phys_proc_id into topology info Rename it to pkg_id which is the terminology used in the kernel. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.329006989@linutronix.de	2023-10-10 14:38:17 +02:00
Thomas Gleixner	b9655e702d	x86/cpu: Encapsulate topology information in cpuinfo_x86 The topology related information is randomly scattered across cpuinfo_x86. Create a new structure cpuinfo_topo and move in a first step initial_apicid and apicid into it. Aside of being better readable this is in preparation for replacing the horribly fragile CPU topology evaluation code further down the road. Consolidate APIC ID fields to u32 as that represents the hardware type. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.269787744@linutronix.de	2023-10-10 14:38:17 +02:00
Thomas Gleixner	965e05ff8a	x86/apic: Fake primary thread mask for XEN/PV The SMT control mechanism got added as speculation attack vector mitigation. The implemented logic relies on the primary thread mask to be set up properly. This turns out to be an issue with XEN/PV guests because their CPU hotplug mechanics do not enumerate APICs and therefore the mask is never correctly populated. This went unnoticed so far because by chance XEN/PV ends up with smp_num_siblings == 2. So cpu_smt_control stays at its default value CPU_SMT_ENABLED and the primary thread mask is never evaluated in the context of CPU hotplug. This stopped "working" with the upcoming overhaul of the topology evaluation which legitimately provides a fake topology for XEN/PV. That sets smp_num_siblings to 1, which causes the core CPU hot-plug core to refuse to bring up the APs. This happens because cpu_smt_control is set to CPU_SMT_NOT_SUPPORTED which causes cpu_bootable() to evaluate the unpopulated primary thread mask with the conclusion that all non-boot CPUs are not valid to be plugged. The core code has already been made more robust against this kind of fail, but the primary thread mask really wants to be populated to avoid other issues all over the place. Just fake the mask by pretending that all XEN/PV vCPUs are primary threads, which is consistent because all of XEN/PVs topology is fake or non-existent. Fixes: 6a4d2657e048 ("x86/smp: Provide topology_is_primary_thread()") Fixes: f54d4434c281 ("x86/apic: Provide cpu_primary_thread mask") Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.210011520@linutronix.de	2023-10-10 14:38:17 +02:00
Pu Wen	ee545b94d3	x86/cpu/hygon: Fix the CPU topology evaluation for real Hygon processors with a model ID > 3 have CPUID leaf 0xB correctly populated and don't need the fixed package ID shift workaround. The fixup is also incorrect when running in a guest. Fixes: e0ceeae708ce ("x86/CPU/hygon: Fix phys_proc_id calculation logic for multi-die processors") Signed-off-by: Pu Wen <puwen@hygon.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/tencent_594804A808BD93A4EBF50A994F228E3A7F07@qq.com Link: https://lore.kernel.org/r/20230814085112.089607918@linutronix.de	2023-10-10 14:38:16 +02:00
Like Xu	bf328e22e4	KVM: x86: Don't sync user-written TSC against startup values The legacy API for setting the TSC is fundamentally broken, and only allows userspace to set a TSC "now", without any way to account for time lost between the calculation of the value, and the kernel eventually handling the ioctl. To work around this, KVM has a hack which, if a TSC is set with a value which is within a second's worth of the last TSC "written" to any vCPU in the VM, assumes that userspace actually intended the two TSC values to be in sync and adjusts the newly-written TSC value accordingly. Thus, when a VMM restores a guest after suspend or migration using the legacy API, the TSCs aren't necessarily right, but at least they're in sync. This trick falls down when restoring a guest which genuinely has been running for less time than the 1 second of imprecision KVM allows for in in the legacy API. On creation, the first vCPU starts its TSC counting from zero, and the subsequent vCPUs synchronize to that. But then when the VMM tries to restore a vCPU's intended TSC, because the VM has been alive for less than 1 second and KVM's default TSC value for new vCPU's is '0', the intended TSC is within a second of the last "written" TSC and KVM incorrectly adjusts the intended TSC in an attempt to synchronize. But further hacks can be piled onto KVM's existing hackish ABI, and declare that the first value written by userspace (on any vCPU) should not be subject to this "correction", i.e. KVM can assume that the first write from userspace is not an attempt to sync up with TSC values that only come from the kernel's default vCPU creation. To that end: Add a flag, kvm->arch.user_set_tsc, protected by kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in the VM has been set by userspace, and make the 1-second slop hack only trigger if user_set_tsc is already set. Note that userspace can explicitly request a synchronization of the TSC by writing zero. For the purpose of user_set_tsc, an explicit synchronization counts as "setting" the TSC, i.e. if userspace then subsequently writes an explicit non-zero value which happens to be within 1 second of the previous value, the new value will be "corrected". This behavior is deliberate, as treating explicit synchronization as "setting" the TSC preserves KVM's existing behaviour inasmuch as possible (KVM always applied the 1-second "correction" regardless of whether the write came from userspace vs. the kernel). Reported-by: Yong He <alexyonghe@tencent.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423 Suggested-by: Oliver Upton <oliver.upton@linux.dev> Original-by: Oliver Upton <oliver.upton@linux.dev> Original-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Tested-by: Yong He <alexyonghe@tencent.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231008025335.7419-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-09 17:29:52 -07:00
Yan Zhao	9a3768191d	KVM: x86/mmu: Zap SPTEs on MTRR update iff guest MTRRs are honored When guest MTRRs are updated, zap SPTEs and do zap range calcluation if and only if KVM's MMU is honoring guest MTRRs, which is the only time that KVM incorporates the guest's MTRR type into the final memtype. Suggested-by: Chao Gao <chao.gao@intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065156.20375-1-yan.y.zhao@intel.com [sean: rephrase shortlog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-09 14:35:14 -07:00
Yan Zhao	7a18c7c2b6	KVM: x86/mmu: Zap SPTEs when CR0.CD is toggled iff guest MTRRs are honored Zap SPTEs when CR0.CD is toggled if and only if KVM's MMU is honoring guest MTRRs, which is the only time that KVM incorporates the guest's CR0.CD into the final memtype. Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065122.20315-1-yan.y.zhao@intel.com [sean: rephrase shortlog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-09 14:35:13 -07:00
Yan Zhao	1affe455d6	KVM: x86/mmu: Add helpers to return if KVM honors guest MTRRs Add helpers to check if KVM honors guest MTRRs instead of open coding the logic in kvm_tdp_page_fault(). Future fixes and cleanups will also need to determine if KVM should honor guest MTRRs, e.g. for CR0.CD toggling and and non-coherent DMA transitions. Provide an inner helper, __kvm_mmu_honors_guest_mtrrs(), so that KVM can check if guest MTRRs were honored when stopping non-coherent DMA. Note, there is no need to explicitly check that TDP is enabled, KVM clears shadow_memtype_mask when TDP is disabled, i.e. it's non-zero if and only if EPT is enabled. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065006.20201-1-yan.y.zhao@intel.com Link: https://lore.kernel.org/r/20230714065043.20258-1-yan.y.zhao@intel.com [sean: squash into a one patch, drop explicit TDP check massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-09 14:34:57 -07:00
Jim Mattson	8b0e00fba9	KVM: x86: Virtualize HWCR.TscFreqSel[bit 24] On certain CPUs, Linux guests expect HWCR.TscFreqSel[bit 24] to be set. If it isn't set, they complain: [Firmware Bug]: TSC doesn't count with P0 frequency! Allow userspace (and the guest) to set this bit in the virtual HWCR to eliminate the above complaint. Allow the guest to write the bit even though its is R/O on some CPUs. Like many bits in HWRC, TscFreqSel is not architectural at all. On Family 10h[1], it was R/W and powered on as 0. In Family 15h, one of the "changes relative to Family 10H Revision D processors[2] was: • MSRC001_0015 [Hardware Configuration (HWCR)]: • Dropped TscFreqSel; TSC can no longer be selected to run at NB P0-state. Despite the "Dropped" above, that same document later describes HWCR[bit 24] as follows: TscFreqSel: TSC frequency select. Read-only. Reset: 1. 1=The TSC increments at the P0 frequency If the guest clears the bit, the worst case scenario is the guest will be no worse off than it is today, e.g. the whining may return after a guest clears the bit and kexec()'s into a new kernel. [1] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/31116.pdf [2] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/42301_15h_Mod_00h-0Fh_BKDG.pdf, Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20230929230246.1954854-3-jmattson@google.com [sean: elaborate on why the bit is writable by the guest] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-09 12:36:16 -07:00
Jim Mattson	598a790fc2	KVM: x86: Allow HWCR.McStatusWrEn to be cleared once set When HWCR is set to 0, store 0 in vcpu->arch.msr_hwcr. Fixes: 191c8137a939 ("x86/kvm: Implement HWCR support") Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20230929230246.1954854-2-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-09 12:36:15 -07:00
Uros Bizjak	636d6a8b85	locking/atomic/x86: Introduce arch_sync_try_cmpxchg() Introduce the arch_sync_try_cmpxchg() macro to improve code using sync_try_cmpxchg() locking primitive. The new definitions use existing __raw_try_cmpxchg() macros, but use its own "lock; " prefix. The new macros improve assembly of the cmpxchg loop in evtchn_fifo_unmask() from drivers/xen/events/events_fifo.c from: 57a: 85 c0 test %eax,%eax 57c: 78 52 js 5d0 <...> 57e: 89 c1 mov %eax,%ecx 580: 25 ff ff ff af and $0xafffffff,%eax 585: c7 04 24 00 00 00 00 movl $0x0,(%rsp) 58c: 81 e1 ff ff ff ef and $0xefffffff,%ecx 592: 89 4c 24 04 mov %ecx,0x4(%rsp) 596: 89 44 24 08 mov %eax,0x8(%rsp) 59a: 8b 74 24 08 mov 0x8(%rsp),%esi 59e: 8b 44 24 04 mov 0x4(%rsp),%eax 5a2: f0 0f b1 32 lock cmpxchg %esi,(%rdx) 5a6: 89 04 24 mov %eax,(%rsp) 5a9: 8b 04 24 mov (%rsp),%eax 5ac: 39 c1 cmp %eax,%ecx 5ae: 74 07 je 5b7 <...> 5b0: a9 00 00 00 40 test $0x40000000,%eax 5b5: 75 c3 jne 57a <...> <...> to: 578: a9 00 00 00 40 test $0x40000000,%eax 57d: 74 2b je 5aa <...> 57f: 85 c0 test %eax,%eax 581: 78 40 js 5c3 <...> 583: 89 c1 mov %eax,%ecx 585: 25 ff ff ff af and $0xafffffff,%eax 58a: 81 e1 ff ff ff ef and $0xefffffff,%ecx 590: 89 4c 24 04 mov %ecx,0x4(%rsp) 594: 89 44 24 08 mov %eax,0x8(%rsp) 598: 8b 4c 24 08 mov 0x8(%rsp),%ecx 59c: 8b 44 24 04 mov 0x4(%rsp),%eax 5a0: f0 0f b1 0a lock cmpxchg %ecx,(%rdx) 5a4: 89 44 24 04 mov %eax,0x4(%rsp) 5a8: 75 30 jne 5da <...> <...> 5da: 8b 44 24 04 mov 0x4(%rsp),%eax 5de: eb 98 jmp 578 <...> The new code removes move instructions from 585: 5a6: and 5a9: and the compare from 5ac:. Additionally, the compiler assumes that cmpxchg success is more probable and optimizes code flow accordingly. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org	2023-10-09 18:14:25 +02:00
Ingo Molnar	fdb8b7a1af	Linux 6.6-rc5 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmUjFeceHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNCAH/RDI8G44DCV9Ps5U rl/FMf6iLUxU6fCS3Wwe8vtppLjPP7Y16AH5HKMumoDIqTfh9ZAUVKhZfT+PTgz3 /oFXcGzZQLTcdbtH7XK2/zk7N/RI25/rDiCDd1uIJVCNii+hsBKS6Ihc4wXadxaR 0z3lwoEKp2egeaeqmJWMzJLdjRrYhLs33+SEciVYqTiIvlWsM5QBm/sMvES7V57s TXrs5/y7yXtDBZ2PgYNCBRLyBazjqB28x07aQoePOAs6nFXl5N/wWPW/4wirWFHT s9LYZlmVo+O+RHWj10ASm/2l+ihgn959ZfRj1VekK2AWU1x/VzSPcuCXKvsrUoa+ xEjL+vM= =efE3 -----END PGP SIGNATURE----- Merge tag 'v6.6-rc5' into locking/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2023-10-09 18:09:23 +02:00
Sandipan Das	25e5684782	perf/x86/amd/uncore: Add memory controller support Unified Memory Controller (UMC) events were introduced with Zen 4 as a part of the Performance Monitoring Version 2 (PerfMonV2) enhancements. An event is specified using the EventSelect bits and the RdWrMask bits can be used for additional filtering of read and write requests. As of now, a maximum of 12 channels of DDR5 are available on each socket and each channel is controlled by a dedicated UMC. Each UMC, in turn, has its own set of performance monitoring counters. Since the MSR address space for the UMC PERF_CTL and PERF_CTR registers are reused across sockets, uncore groups are created on the basis of socket IDs. Hence, group exclusivity is mandatory while opening events so that events for an UMC can only be opened on CPUs which are on the same socket as the corresponding memory channel. For each socket, the total number of available UMC counters and active memory channels are determined from CPUID leaf 0x80000022 EBX and ECX respectively. Usually, on Zen 4, each UMC has four counters. MSR assignments are determined on the basis of active UMCs. E.g. if UMCs 1, 4 and 9 are active for a given socket, then * UMC 1 gets MSRs 0xc0010800 to 0xc0010807 as PERF_CTLs and PERF_CTRs * UMC 4 gets MSRs 0xc0010808 to 0xc001080f as PERF_CTLs and PERF_CTRs * UMC 9 gets MSRs 0xc0010810 to 0xc0010817 as PERF_CTLs and PERF_CTRs If there are sockets without any online CPUs when the amd_uncore driver is loaded, UMCs for such sockets will not be discoverable since the mechanism relies on executing the CPUID instruction on an online CPU from the socket. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/b25f391205c22733493abec1ed850b71784edc5f.1696425185.git.sandipan.das@amd.com	2023-10-09 16:12:25 +02:00
Sandipan Das	83a43c6221	perf/x86/amd/uncore: Add group exclusivity In some cases, it may be necessary to restrict opening PMU events to a subset of CPUs. E.g. Unified Memory Controller (UMC) PMUs are specific to each active memory channel and the MSR address space for the PERF_CTL and PERF_CTR registers is reused on each socket. Thus, opening events for a specific UMC PMU should be restricted to CPUs belonging to the same socket as that of the UMC. The "cpumask" of the PMU should also reflect this accordingly. Uncore PMUs which require this can use the new group attribute in struct amd_uncore_pmu to set a valid group ID during the scan() phase. Later, during init(), an uncore context for a CPU will be unavailable if the group ID does not match. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/937d6d71010a48ea4e069f4904b3116a5f99ecdf.1696425185.git.sandipan.das@amd.com	2023-10-09 16:12:24 +02:00
Sandipan Das	7ef0343855	perf/x86/amd/uncore: Use rdmsr if rdpmc is unavailable Not all uncore PMUs may support the use of the RDPMC instruction for reading counters. In such cases, read the count from the corresponding PERF_CTR register using the RDMSR instruction. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/e9d994e32a3fcb39fa59fcf43ab4260d11aba097.1696425185.git.sandipan.das@amd.com	2023-10-09 16:12:24 +02:00
Sandipan Das	07888daa05	perf/x86/amd/uncore: Move discovery and registration Uncore PMUs have traditionally been registered in the module init path. This is fine for the existing DF and L3 PMUs since the CPUID information does not vary across CPUs but not for the memory controller (UMC) PMUs since information like active memory channels can vary for each socket depending on how the DIMMs have been physically populated. To overcome this, the discovery of PMU information using CPUID is moved to the startup of UNCORE_STARTING. This cannot be done in the startup of UNCORE_PREP since the hotplug callback does not run on the CPU that is being brought online. Previously, the startup of UNCORE_PREP was used for allocating uncore contexts following which, the startup of UNCORE_STARTING was used to find and reuse an existing sibling context, if possible. Any unused contexts were added to a list for reclaimation later during the startup of UNCORE_ONLINE. Since all required CPUID info is now available only after the startup of UNCORE_STARTING has completed, context allocation has been moved to the startup of UNCORE_ONLINE. Before allocating contexts, the first CPU that comes online has to take up the additional responsibility of registering the PMUs. This is a one-time process though. Since sibling discovery now happens prior to deciding whether a new context is required, there is no longer a need to track and free up unused contexts. The teardown of UNCORE_ONLINE and UNCORE_PREP functionally remain the same. Overall, the flow of control described above is achieved using the following handlers for managing uncore PMUs. It is mandatory to define them for each type of uncore PMU. * scan() runs during startup of UNCORE_STARTING and collects PMU info using CPUID. * init() runs during startup of UNCORE_ONLINE, registers PMUs and sets up uncore contexts. * move() runs during teardown of UNCORE_ONLINE and migrates uncore contexts to a shared sibling, if possible. * free() runs during teardown of UNCORE_PREP and frees up uncore contexts. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/e6c447e48872fcab8452e0dd81b1c9cb09f39eb4.1696425185.git.sandipan.das@amd.com	2023-10-09 16:12:23 +02:00
Sandipan Das	d6389d3ccc	perf/x86/amd/uncore: Refactor uncore management Since struct amd_uncore is used to manage per-cpu contexts, rename it to amd_uncore_ctx in order to better reflect its purpose. Add a new struct amd_uncore_pmu to encapsulate all attributes which are shared by per-cpu contexts for a corresponding PMU. These include the number of counters, active mask, MSR and RDPMC base addresses, etc. Since the struct pmu is now embedded, the corresponding amd_uncore_pmu for a given event can be found by simply using container_of(). Finally, move all PMU-specific code to separate functions. While the original event management functions continue to provide the base functionality, all PMU-specific quirks and customizations are applied in separate functions. The motivation is to simplify the management of uncore PMUs. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/24b38c49a5dae65d8c96e5d75a2b96ae97aaa651.1696425185.git.sandipan.das@amd.com	2023-10-09 16:12:23 +02:00
Tero Kristo	05276d4831	perf/x86/cstate: Allow reading the package statistics from local CPU The MSR registers for reading the package residency counters are available on every CPU of the package. To avoid doing unnecessary SMP calls to read the values for these from the various CPUs inside a package, allow reading them from any CPU of the package. Suggested-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230912124432.3616761-2-tero.kristo@linux.intel.com	2023-10-09 16:12:22 +02:00
Joerg Roedel	b9cb9c4558	x86/sev: Check IOBM for IOIO exceptions from user-space Check the IO permission bitmap (if present) before emulating IOIO #VC exceptions for user-space. These permissions are checked by hardware already before the #VC is raised, but due to the VC-handler decoding race it needs to be checked again in software. Fixes: 25189d08e516 ("x86/sev-es: Add support for handling IOIO exceptions") Reported-by: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Dohrmann <erbse.13@gmx.de> Cc: <stable@kernel.org>	2023-10-09 15:47:57 +02:00
Borislav Petkov (AMD)	a37cd2a59d	x86/sev: Disable MMIO emulation from user mode A virt scenario can be constructed where MMIO memory can be user memory. When that happens, a race condition opens between when the hardware raises the #VC and when the #VC handler gets to emulate the instruction. If the MOVS is replaced with a MOVS accessing kernel memory in that small race window, then write to kernel memory happens as the access checks are not done at emulation time. Disable MMIO emulation in user mode temporarily until a sensible use case appears and justifies properly handling the race window. Fixes: 0118b604c2c9 ("x86/sev-es: Handle MMIO String Instructions") Reported-by: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Dohrmann <erbse.13@gmx.de> Cc: <stable@kernel.org>	2023-10-09 15:45:34 +02:00
Lucy Mielke	38cd5b6a87	perf/x86/intel/pt: Fix kernel-doc comments Some parameters or return codes were either wrong or missing, update them. Signed-off-by: Lucy Mielke <lucymielke@icloud.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/ZSOjQW3e2nJR4bAo@fedora.fritz.box	2023-10-09 12:26:22 +02:00
JP Kobryn	e53899771a	perf/x86/lbr: Filter vsyscall addresses We found that a panic can occur when a vsyscall is made while LBR sampling is active. If the vsyscall is interrupted (NMI) for perf sampling, this call sequence can occur (most recent at top): __insn_get_emulate_prefix() insn_get_emulate_prefix() insn_get_prefixes() insn_get_opcode() decode_branch_type() get_branch_type() intel_pmu_lbr_filter() intel_pmu_handle_irq() perf_event_nmi_handler() Within __insn_get_emulate_prefix() at frame 0, a macro is called: peek_nbyte_next(insn_byte_t, insn, i) Within this macro, this dereference occurs: (insn)->next_byte Inspecting registers at this point, the value of the next_byte field is the address of the vsyscall made, for example the location of the vsyscall version of gettimeofday() at 0xffffffffff600000. The access to an address in the vsyscall region will trigger an oops due to an unhandled page fault. To fix the bug, filtering for vsyscalls can be done when determining the branch type. This patch will return a "none" branch if a kernel address if found to lie in the vsyscall region. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org	2023-10-08 12:25:18 +02:00
Kees Cook	a56d5551e1	perf/x86/rapl: Annotate 'struct rapl_pmus' with __counted_by Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS=y (for array indexing) and CONFIG_FORTIFY_SOURCE=y (for strcpy/memcpy-family functions). Found with Coccinelle: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1] Add __counted_by for 'struct rapl_pmus'. No change in functionality intended. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20231006201754.work.473-kees@kernel.org	2023-10-08 12:18:17 +02:00
Randy Dunlap	025d5ac978	x86/resctrl: Fix kernel-doc warnings The kernel test robot reported kernel-doc warnings here: monitor.c:34: warning: Cannot understand * @rmid_free_lru A least recently used list of free RMIDs on line 34 - I thought it was a doc line monitor.c:41: warning: Cannot understand * @rmid_limbo_count count of currently unused but (potentially) on line 41 - I thought it was a doc line monitor.c:50: warning: Cannot understand * @rmid_entry - The entry in the limbo and free lists. on line 50 - I thought it was a doc line We don't have a syntax for documenting individual data items via kernel-doc, so remove the "/**" kernel-doc markers and add a hyphen for consistency. Fixes: 6a445edce657 ("x86/intel_rdt/cqm: Add RDT monitoring initialization") Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231006235132.16227-1-rdunlap@infradead.org	2023-10-08 11:45:16 +02:00
Waiman Long	2743fe89d4	x86/idle: Disable IBRS when CPU is offline to improve single-threaded performance Commit bf5835bcdb96 ("intel_idle: Disable IBRS during long idle") disables IBRS when the CPU enters long idle. However, when a CPU becomes offline, the IBRS bit is still set when X86_FEATURE_KERNEL_IBRS is enabled. That will impact the performance of a sibling CPU. Mitigate this performance impact by clearing all the mitigation bits in SPEC_CTRL MSR when offline. When the CPU is online again, it will be re-initialized and so restoring the SPEC_CTRL value isn't needed. Add a comment to say that native_play_dead() is a __noreturn function, but it can't be marked as such to avoid confusion about the missing MSR restoration code. When DPDK is running on an isolated CPU thread processing network packets in user space while its sibling thread is idle. The performance of the busy DPDK thread with IBRS on and off in the sibling idle thread are: IBRS on IBRS off ------- -------- packets/second: 7.8M 10.4M avg tsc cycles/packet: 282.26 209.86 This is a 25% performance degradation. The test system is a Intel Xeon 4114 CPU @ 2.20GHz. [ mingo: Extended the changelog with performance data from the 0/4 mail. ] Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230727184600.26768-3-longman@redhat.com	2023-10-07 11:33:28 +02:00
Waiman Long	e3e3bab184	x86/speculation: Add __update_spec_ctrl() helper Add a new __update_spec_ctrl() helper which is a variant of update_spec_ctrl() that can be used in a noinstr function. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230727184600.26768-2-longman@redhat.com	2023-10-07 11:33:28 +02:00
Baolin Wang	55d2a0bd5e	mm: add statistics for PUD level pagetable Recently, we found that cross-die access to pagetable pages on ARM64 machines can cause performance fluctuations in our business. Currently, there are no PMU events available to track this situation on our ARM64 machines, so accurate pagetable accounting can help to analyze this issue, but now the PUD level pagetable accounting is missed. So introduce pagetable_pud_ctor/dtor() to help to get accurate PUD pagetable accounting, as well as converting the architectures which use generic PUD pagetable allocation to add corresponding PUD pagetable accounting. Moreover this patch will mark the PUD level pagetable with PG_table flag, which will help to do sanity validation in unpoison_memory(). On my testing machine, I can see more pagetables statistics after the patch with page-types tool: Before patch: flags page-count MB symbolic-flags long-symbolic-flags 0x0000000004000000 27326 106 __________________________g_________________ pgtable After patch: 0x0000000004000000 27541 107 __________________________g_________________ pgtable Link: https://lkml.kernel.org/r/876c71c03a7e69c17722a690e3225a4f7b172fb2.1695017383.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-10-06 14:44:10 -07:00
Sohil Mehta	2fd0ebad27	arch: Reserve map_shadow_stack() syscall number for all architectures commit c35559f94ebc ("x86/shstk: Introduce map_shadow_stack syscall") recently added support for map_shadow_stack() but it is limited to x86 only for now. There is a possibility that other architectures (namely, arm64 and RISC-V), that are implementing equivalent support for shadow stacks, might need to add support for it. Independent of that, reserving arch-specific syscall numbers in the syscall tables of all architectures is good practice and would help avoid future conflicts. map_shadow_stack() is marked as a conditional syscall in sys_ni.c. Adding it to the syscall tables of other architectures is harmless and would return ENOSYS when exercised. Note, map_shadow_stack() was assigned #453 during the merge process since #452 was taken by fchmodat2(). For Powerpc, map it to sys_ni_syscall() as is the norm for Powerpc syscall tables. For Alpha, map_shadow_stack() takes up #563 as Alpha still diverges from the common syscall numbering system in the other architectures. Link: https://lore.kernel.org/lkml/20230515212255.GA562920@debug.ba.rivosinc.com/ Link: https://lore.kernel.org/lkml/b402b80b-a7c6-4ef0-b977-c0f5f582b78a@sirena.org.uk/ Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2023-10-06 22:26:51 +02:00
Kirill A. Shutemov	9ee4318c15	x86/tdx: Mark TSC reliable In x86 virtualization environments, including TDX, RDTSC instruction is handled without causing a VM exit, resulting in minimal overhead and jitters. On the other hand, other clock sources (such as HPET, ACPI timer, APIC, etc.) necessitate VM exits to implement, resulting in more fluctuating measurements compared to TSC. Thus, those clock sources are not effective for calibrating TSC. As a foundation, the host TSC is guaranteed to be invariant on any system which enumerates TDX support. TDX guests and the TDX module build on that foundation by enforcing: - Virtual TSC is monotonously incrementing for any single VCPU; - Virtual TSC values are consistent among all the TD’s VCPUs at the level supported by the CPU: + VMM is required to set the same TSC_ADJUST; + VMM must not modify from initial value of TSC_ADJUST before SEAMCALL; - The frequency is determined by TD configuration: + Virtual TSC frequency is specified by VMM on TDH.MNG.INIT; + Virtual TSC starts counting from 0 at TDH.MNG.INIT; The result is that a reliable TSC is a TDX architectural guarantee. Use the TSC as the only reliable clock source in TD guests, bypassing unstable calibration. This is similar to what the kernel already does in some VMWare and HyperV environments. [ dhansen: changelog tweaks ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Erdem Aktas <erdemaktas@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/20231006144549.2633-1-kirill.shutemov%40linux.intel.com	2023-10-06 10:00:04 -07:00
Mario Limonciello	7d08f21f8c	x86/PCI: Avoid PME from D3hot/D3cold for AMD Rembrandt and Phoenix USB4 Iain reports that USB devices can't be used to wake a Lenovo Z13 from suspend. This occurs because on some AMD platforms, even though the Root Ports advertise PME_Support for D3hot and D3cold, wakeup events from devices on a USB4 controller don't result in wakeup interrupts from the Root Port when amd-pmc has put the platform in a hardware sleep state. If amd-pmc will be involved in the suspend, remove D3hot and D3cold from the PME_Support mask of Root Ports above USB4 controllers so we avoid those states if we need wakeups. Restore D3 support at resume so that it can be used by runtime suspend. This affects both AMD Rembrandt and Phoenix SoCs. "pm_suspend_target_state == PM_SUSPEND_ON" means we're doing runtime suspend, and amd-pmc will not be involved. In that case PMEs work as advertised in D3hot/D3cold, so we don't need to do anything. Note that amd-pmc is technically optional, and there's no need for this quirk if it's not present, but we assume it's always present because power consumption is so high without it. Fixes: 9d26d3a8f1b0 ("PCI: Put PCIe ports into D3 during suspend") Link: https://lore.kernel.org/r/20231004144959.158840-1-mario.limonciello@amd.com Reported-by: Iain Lane <iain@orangesquash.org.uk> Closes: https://forums.lenovo.com/t5/Ubuntu/Z13-can-t-resume-from-suspend-with-external-USB-keyboard/m-p/5217121 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> [bhelgaas: commit log, move to arch/x86/pci/fixup.c, add #includes] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org	2023-10-06 09:09:47 -05:00

... 3 4 5 6 7 ...

45181 Commits