linux

iv/linux

Author	SHA1	Message	Date
Ard Biesheuvel	eb54c2ae4a	x86/boot/64: Use RIP_REL_REF() to access early page tables The early statically allocated page tables are populated from code that executes from a 1:1 mapping so it cannot use plain accesses from C. Replace the use of fixup_pointer() with RIP_REL_REF(), which is better and simpler. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240221113506.2565718-23-ardb+git@google.com	2024-02-26 12:58:35 +01:00
Ard Biesheuvel	4f8b6cf25f	x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask' '__supported_pte_mask' is accessed from code that executes from a 1:1 mapping so it cannot use a plain access from C. Replace the use of fixup_pointer() with RIP_REL_REF(), which is better and simpler. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240221113506.2565718-22-ardb+git@google.com	2024-02-26 12:58:35 +01:00
Ard Biesheuvel	b0fe5fb609	x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[] early_dynamic_pgts[] and next_early_pgt are accessed from code that executes from a 1:1 mapping so it cannot use a plain access from C. Replace the use of fixup_pointer() with RIP_REL_REF(), which is better and simpler. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240221113506.2565718-21-ardb+git@google.com	2024-02-26 12:58:35 +01:00
Ard Biesheuvel	d9ec115805	x86/boot/64: Use RIP_REL_REF() to assign 'phys_base' 'phys_base' is assigned from code that executes from a 1:1 mapping so it cannot use a plain access from C. Replace the use of fixup_pointer() with RIP_REL_REF(), which is better and simpler. While at it, move the assignment to before the addition of the SME mask so there is no need to subtract it again, and drop the unnecessary addition ('phys_base' is statically initialized to 0x0) Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240221113506.2565718-20-ardb+git@google.com	2024-02-26 12:58:35 +01:00
Ard Biesheuvel	5da7936719	x86/boot/64: Simplify global variable accesses in GDT/IDT programming There are two code paths in the startup code to program an IDT: one that runs from the 1:1 mapping and one that runs from the virtual kernel mapping. Currently, these are strictly separate because fixup_pointer() is used on the 1:1 path, which will produce the wrong value when used while executing from the virtual kernel mapping. Switch to RIP_REL_REF() so that the two code paths can be merged. Also, move the GDT and IDT descriptors to the stack so that they can be referenced directly, rather than via RIP_REL_REF(). Rename startup_64_setup_env() to startup_64_setup_gdt_idt() while at it, to make the call from assembler self-documenting. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240221113506.2565718-19-ardb+git@google.com	2024-02-26 12:58:11 +01:00
Ingo Molnar	2e5fc4786b	Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up dependent tree We are going to queue up a number of patches that depend on fresh changes in x86/sev - merge in that branch to reduce the number of conflicts going forward. Also resolve a current conflict with x86/sev. Conflicts: arch/x86/include/asm/coco.h Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-02-26 11:10:35 +01:00
Ingo Molnar	29cd85557d	Linux 6.8-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmXb0T4eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5YQH/3eCV90sNGch0Y94 8rtTdqFrVx7QPNl0pz+Mo6OUIKUUHvTuwime16ckLxG+3x2Y3I0MjP1edd1NB99C Kje//JTpaZBPpTZ/jY4u8B1Shov2Drdx/J4NFnE/9rG6yXzKQBtvON/xAxXDCVHT mLhst2LR0FeCSMk9jAX6CoqUPEgwlylNyAetKxaDQgoHl4GTZC7FDO17WxyjpIxe 1rVHsrV9Eq8kD4uxrzpTYWgZrwTObPmlZjvefa1JfzSwRNABIBJj/C1nra1Zc1oi b7xVaXS1cMOxrtuuG00fmHsPnWivu0tuND7H3/yLd1mRCZAPSsVbVvrI/KNtoeV4 1euINlY= =7IFt -----END PGP SIGNATURE----- Merge tag 'v6.8-rc6' into x86/boot, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-02-26 11:06:14 +01:00
Linus Torvalds	1eee4ef38c	- Make sure clearing CPU buffers using VERW happens at the latest possible point in the return-to-userspace path, otherwise memory accesses after the VERW execution could cause data to land in CPU buffers again -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXbG7IACgkQEsHwGGHe VUoEEg//d1qt/PEWCC23wMO6gLMl4J/e4ZQAuGOKGed/jUmOaQKpHJmpDMRc0li5 llRYDdfE0ikmtQT3t9vQDs3xbWfT5bLMsijliRimb193FaS1HGlHMMS1nxhfjyfv MecbWfkwzX2JnrxJpsbfue+7kks3HyIXYsXV7kSFiHavk4F3GFQXYLO11pKbNQwN 9UfjJDeVsrcWPGCHhoPKF5NHUnQKIA8ZC6g8yBq894AtdWOhFY7ePKBZefUWQQ1n myc5GJ3dKFICMCZvkMABtHYCmHU/W3y/6tPtnrz3kT8GdCIAHG+K9VRUfY1ml94H x327GoM3sEzHLsPizKy00/Uao+j6FOtv631LoDLsO2MF3sHoTZDaSgg5y2D/ZC7t IZdK3mUGtdINRhGiWWpdxyaMfkQ62cdZk8FkeYkRAewYS6WYSdMX3cPqFNy4Ss5u r3reMOD3JcxAatcqhHMXjARMfY+N08gQBpxBul3ejgH8t8aY7xJx6Vggty5kBlHZ 7urV9jIRxSXfbBmOcYu6HP1ucFLWNSUQCBn7Imrh+5zbE1XVv7NaAWvT4Nmgb0/X 57fHoYYSVwaJ0k3zWWM7QYEdcuJ7IZnVgTCQYx26Ec2AOQRxE9ose+awTLYtTbp1 T+XaOlItHKMRzx9K46D7xHwmC5qiokFki3exp5vfGZxGyT3+t/c= =n5us -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Make sure clearing CPU buffers using VERW happens at the latest possible point in the return-to-userspace path, otherwise memory accesses after the VERW execution could cause data to land in CPU buffers again * tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM/VMX: Move VERW closer to VMentry for MDS mitigation KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key x86/entry_32: Add VERW just before userspace transition x86/entry_64: Add VERW just before userspace transition x86/bugs: Add asm helpers for executing VERW	2024-02-25 10:22:21 -08:00
Thomas Gleixner	c147e1ef59	x86/apic/msi: Use DOMAIN_BUS_GENERIC_MSI for HPET/IO-APIC domain search The recent restriction to invoke irqdomain_ops::select() only when the domain bus token is not DOMAIN_BUS_ANY breaks the search for the parent MSI domain of HPET and IO-APIC. The latter causes a full boot fail. The restriction itself makes sense to avoid adding DOMAIN_BUS_ANY matches into the various ARM specific select() callbacks. Reverting this change would obviously break ARM platforms again and require DOMAIN_BUS_ANY matches added to various places. A simpler solution is to use the DOMAIN_BUS_GENERIC_MSI token for the HPET and IO-APIC parent domain search. This works out of the box because the affected parent domains check only for the firmware specification content and not for the bus token. Fixes: `5aa3c0cf5b` ("genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens") Reported-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/878r38cy8n.ffs@tglx	2024-02-25 18:53:08 +01:00
Linus Torvalds	ac389bc0ca	cxl fixes for 6.8-rc6 - Fix NUMA initialization from ACPI CEDT.CFMWS - Fix region assembly failures due to async init order - Fix / simplify export of qos_class information - Fix cxl_acpi initialization vs single-window-init failures - Fix handling of repeated 'pci_channel_io_frozen' notifications - Workaround platforms that violate host-physical-address == system-physical address assumptions - Defer CXL CPER notification handling to v6.9 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCZdpH9gAKCRDfioYZHlFs ZwZlAQDE+PxTJnjCXDVnDylVF4yeJF2G/wSkH1CFVFVxa0OjhAD/ZFScS/nz/76l 1IYYiiLqmVO5DdmJtfKtq16m7e1cZwc= =PuPF -----END PGP SIGNATURE----- Merge tag 'cxl-fixes-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl fixes from Dan Williams: "A collection of significant fixes for the CXL subsystem. The largest change in this set, that bordered on "new development", is the fix for the fact that the location of the new qos_class attribute did not match the Documentation. The fix ends up deleting more code than it added, and it has a new unit test to backstop basic errors in this interface going forward. So the "red-diff" and unit test saved the "rip it out and try again" response. In contrast, the new notification path for firmware reported CXL errors (CXL CPER notifications) has a locking context bug that can not be fixed with a red-diff. Given where the release cycle stands, it is not comfortable to squeeze in that fix in these waning days. So, that receives the "back it out and try again later" treatment. There is a regression fix in the code that establishes memory NUMA nodes for platform CXL regions. That has an ack from x86 folks. There are a couple more fixups for Linux to understand (reassemble) CXL regions instantiated by platform firmware. The policy around platforms that do not match host-physical-address with system-physical-address (i.e. systems that have an address translation mechanism between the address range reported in the ACPI CEDT.CFMWS and endpoint decoders) has been softened to abort driver load rather than teardown the memory range (can cause system hangs). Lastly, there is a robustness / regression fix for cases where the driver would previously continue in the face of error, and a fixup for PCI error notification handling. Summary: - Fix NUMA initialization from ACPI CEDT.CFMWS - Fix region assembly failures due to async init order - Fix / simplify export of qos_class information - Fix cxl_acpi initialization vs single-window-init failures - Fix handling of repeated 'pci_channel_io_frozen' notifications - Workaround platforms that violate host-physical-address == system-physical address assumptions - Defer CXL CPER notification handling to v6.9" * tag 'cxl-fixes-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/acpi: Fix load failures due to single window creation failure acpi/ghes: Remove CXL CPER notifications cxl/pci: Fix disabling memory if DVSEC CXL Range does not match a CFMWS window cxl/test: Add support for qos_class checking cxl: Fix sysfs export of qos_class for memdev cxl: Remove unnecessary type cast in cxl_qos_class_verify() cxl: Change 'struct cxl_memdev_state' *_perf_list to single 'struct cxl_dpa_perf' cxl/region: Allow out of order assembly of autodiscovered regions cxl/region: Handle endpoint decoders in cxl_region_find_decoder() x86/numa: Fix the sort compare func used in numa_fill_memblks() x86/numa: Fix the address overlap check in numa_fill_memblks() cxl/pci: Skip to handle RAS errors if CXL.mem device is detached	2024-02-24 15:53:40 -08:00
Baoquan He	a4eeb2176d	x86, crash: wrap crash dumping code into crash related ifdefs Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on x86 with some adjustments. Here, also change some ifdefs or IS_ENABLED() check to more appropriate ones, e,g - #ifdef CONFIG_KEXEC_CORE -> #ifdef CONFIG_CRASH_DUMP - (!IS_ENABLED(CONFIG_KEXEC_CORE)) - > (!IS_ENABLED(CONFIG_CRASH_RESERVE)) [bhe@redhat.com: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope] Link: https://lore.kernel.org/all/SN6PR02MB4157931105FA68D72E3D3DB8D47B2@SN6PR02MB4157.namprd02.prod.outlook.com/T/#u Link: https://lkml.kernel.org/r/20240124051254.67105-7-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-23 17:48:23 -08:00
Baoquan He	443cbaf9e2	crash: split vmcoreinfo exporting code out from crash_core.c Now move the relevant codes into separate files: kernel/crash_reserve.c, include/linux/crash_reserve.h. And add config item CRASH_RESERVE to control its enabling. And also update the old ifdeffery of CONFIG_CRASH_CORE, including of <linux/crash_core.h> and config item dependency on CRASH_CORE accordingly. And also do renaming as follows: - arch/xxx/kernel/{crash_core.c => vmcore_info.c} because they are only related to vmcoreinfo exporting on x86, arm64, riscv. And also Remove config item CRASH_CORE, and rely on CONFIG_KEXEC_CORE to decide if build in crash_core.c. [yang.lee@linux.alibaba.com: remove duplicated include in vmcore_info.c] Link: https://lkml.kernel.org/r/20240126005744.16561-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/20240124051254.67105-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-23 17:48:22 -08:00
Baoquan He	85fcde402d	kexec: split crashkernel reservation code out from crash_core.c Patch series "Split crash out from kexec and clean up related config items", v3. Motivation: ============= Previously, LKP reported a building error. When investigating, it can't be resolved reasonablly with the present messy kdump config items. https://lore.kernel.org/oe-kbuild-all/202312182200.Ka7MzifQ-lkp@intel.com/ The kdump (crash dumping) related config items could causes confusions: Firstly, CRASH_CORE enables codes including - crashkernel reservation; - elfcorehdr updating; - vmcoreinfo exporting; - crash hotplug handling; Now fadump of powerpc, kcore dynamic debugging and kdump all selects CRASH_CORE, while fadump - fadump needs crashkernel parsing, vmcoreinfo exporting, and accessing global variable 'elfcorehdr_addr'; - kcore only needs vmcoreinfo exporting; - kdump needs all of the current kernel/crash_core.c. So only enabling PROC_CORE or FA_DUMP will enable CRASH_CORE, this mislead people that we enable crash dumping, actual it's not. Secondly, It's not reasonable to allow KEXEC_CORE select CRASH_CORE. Because KEXEC_CORE enables codes which allocate control pages, copy kexec/kdump segments, and prepare for switching. These codes are shared by both kexec reboot and kdump. We could want kexec reboot, but disable kdump. In that case, CRASH_CORE should not be selected. -------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC=y CONFIG_KEXEC_FILE=y --------------------- Thirdly, It's not reasonable to allow CRASH_DUMP select KEXEC_CORE. That could make KEXEC_CORE, CRASH_DUMP are enabled independently from KEXEC or KEXEC_FILE. However, w/o KEXEC or KEXEC_FILE, the KEXEC_CORE code built in doesn't make any sense because no kernel loading or switching will happen to utilize the KEXEC_CORE code. --------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y --------------------- In this case, what is worse, on arch sh and arm, KEXEC relies on MMU, while CRASH_DUMP can still be enabled when !MMU, then compiling error is seen as the lkp test robot reported in above link. ------arch/sh/Kconfig------ config ARCH_SUPPORTS_KEXEC def_bool MMU config ARCH_SUPPORTS_CRASH_DUMP def_bool BROKEN_ON_SMP --------------------------- Changes: =========== 1, split out crash_reserve.c from crash_core.c; 2, split out vmcore_infoc. from crash_core.c; 3, move crash related codes in kexec_core.c into crash_core.c; 4, remove dependency of FA_DUMP on CRASH_DUMP; 5, clean up kdump related config items; 6, wrap up crash codes in crash related ifdefs on all 8 arch-es which support crash dumping, except of ppc; Achievement: =========== With above changes, I can rearrange the config item logic as below (the right item depends on or is selected by the left item): PROC_KCORE -----------> VMCORE_INFO \|----------> VMCORE_INFO FA_DUMP----\| \|----------> CRASH_RESERVE ---->VMCORE_INFO / \|---->CRASH_RESERVE KEXEC --\| /\| \|--> KEXEC_CORE--> CRASH_DUMP-->/-\|---->PROC_VMCORE KEXEC_FILE --\| \ \| \---->CRASH_HOTPLUG KEXEC --\| \|--> KEXEC_CORE (for kexec reboot only) KEXEC_FILE --\| Test ======== On all 8 architectures, including x86_64, arm64, s390x, sh, arm, mips, riscv, loongarch, I did below three cases of config item setting and building all passed. Take configs on x86_64 as exampmle here: (1) Both CONFIG_KEXEC and KEXEC_FILE is unset, then all kexec/kdump items are unset automatically: # Kexec and crash features # CONFIG_KEXEC is not set # CONFIG_KEXEC_FILE is not set # end of Kexec and crash features (2) set CONFIG_KEXEC_FILE and 'make olddefconfig': --------------- # Kexec and crash features CONFIG_CRASH_RESERVE=y CONFIG_VMCORE_INFO=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC_FILE=y CONFIG_CRASH_DUMP=y CONFIG_CRASH_HOTPLUG=y CONFIG_CRASH_MAX_MEMORY_RANGES=8192 # end of Kexec and crash features --------------- (3) unset CONFIG_CRASH_DUMP in case 2 and execute 'make olddefconfig': ------------------------ # Kexec and crash features CONFIG_KEXEC_CORE=y CONFIG_KEXEC_FILE=y # end of Kexec and crash features ------------------------ Note: For ppc, it needs investigation to make clear how to split out crash code in arch folder. Hope Hari and Pingfan can help have a look, see if it's doable. Now, I make it either have both kexec and crash enabled, or disable both of them altogether. This patch (of 14): Both kdump and fa_dump of ppc rely on crashkernel reservation. Move the relevant codes into separate files: crash_reserve.c, include/linux/crash_reserve.h. And also add config item CRASH_RESERVE to control its enabling of the codes. And update config items which has relationship with crashkernel reservation. And also change ifdeffery from CONFIG_CRASH_CORE to CONFIG_CRASH_RESERVE when those scopes are only crashkernel reservation related. And also rename arch/XXX/include/asm/{crash_core.h => crash_reserve.h} on arm64, x86 and risc-v because those architectures' crash_core.h is only related to crashkernel reservation. [akpm@linux-foundation.org: s/CRASH_RESEERVE/CRASH_RESERVE/, per Klara Modin] Link: https://lkml.kernel.org/r/20240124051254.67105-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20240124051254.67105-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-23 17:48:21 -08:00
Oliver Upton	284851ee5c	KVM: Get rid of return value from kvm_arch_create_vm_debugfs() The general expectation with debugfs is that any initialization failure is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows implementations to return an error and kvm_create_vm_debugfs() allows that to fail VM creation. Change to a void return to discourage architectures from making debugfs failures fatal for the VM. Seems like everyone already had the right idea, as all implementations already return 0 unconditionally. Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240216155941.2029458-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2024-02-23 21:44:58 +00:00
Sean Christopherson	d02c357e5b	KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing Retry page faults without acquiring mmu_lock, and without even faulting the page into the primary MMU, if the resolved gfn is covered by an active invalidation. Contending for mmu_lock is especially problematic on preemptible kernels as the mmu_notifier invalidation task will yield mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and ultimately increase the latency of resolving the page fault. And in the worst case scenario, yielding will be accompanied by a remote TLB flush, e.g. if the invalidation covers a large range of memory and vCPUs are accessing addresses that were already zapped. Faulting the page into the primary MMU is similarly problematic, as doing so may acquire locks that need to be taken for the invalidation to complete (the primary MMU has finer grained locks than KVM's MMU), and/or may cause unnecessary churn (getting/putting pages, marking them accessed, etc). Alternatively, the yielding issue could be mitigated by teaching KVM's MMU iterators to perform more work before yielding, but that wouldn't solve the lock contention and would negatively affect scenarios where a vCPU is trying to fault in an address that is NOT covered by the in-progress invalidation. Add a dedicated lockess version of the range-based retry check to avoid false positives on the sanity check on start+end WARN, and so that it's super obvious that checking for a racing invalidation without holding mmu_lock is unsafe (though obviously useful). Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking invalidation in a loop won't put KVM into an infinite loop, e.g. due to caching the in-progress flag and never seeing it go to '0'. Force a load of mmu_invalidate_seq as well, even though it isn't strictly necessary to avoid an infinite loop, as doing so improves the probability that KVM will detect an invalidation that already completed before acquiring mmu_lock and bailing anyways. Do the pre-check even for non-preemptible kernels, as waiting to detect the invalidation until mmu_lock is held guarantees the vCPU will observe the worst case latency in terms of handling the fault, and can generate even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, detect retry, drop mmu_lock, re-enter the guest, retake the fault, and eventually re-acquire mmu_lock. This behavior is also why there are no new starvation issues due to losing the fairness guarantees provided by rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting on mmu_lock doesn't guarantee forward progress in the face of _another_ mmu_notifier invalidation event. Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() may generate a load into a register instead of doing a direct comparison (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost is a few bytes of code and maaaaybe a cycle or three. Reported-by: Yan Zhao <yan.y.zhao@intel.com> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com Reported-by: Friedrich Weber <f.weber@proxmox.com> Cc: Kai Huang <kai.huang@intel.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Yuan Yao <yuan.yao@linux.intel.com> Cc: Xu Yilun <yilun.xu@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-23 10:14:34 -08:00
Kirill A. Shutemov	c2cfc23f79	x86/trampoline: Bypass compat mode in trampoline_start64() if not needed The trampoline_start64() vector is used when a secondary CPU starts in 64-bit mode. The current implementation directly enters compatibility mode. It is necessary to disable paging and re-enable it in the correct paging mode: either 4- or 5-level, depending on the configuration. The X86S[1] ISA does not support compatibility mode in ring 0, and paging cannot be disabled. Rework the trampoline_start64() function to only enter compatibility mode if it is necessary to change the paging mode. If the CPU is already in the desired paging mode, proceed in long mode. This allows a secondary CPU to boot on an X86S machine as long as the CPU is already in the correct paging mode. In the future, there will be a mechanism to switch between paging modes without disabling paging. [1] https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html [ dhansen: changelog tweaks ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/20240126100101.689090-1-kirill.shutemov%40linux.intel.com	2024-02-23 08:40:29 -08:00
Masahiro Yamada	bf48d9b756	kbuild: change tool coverage variables to take the path relative to $(obj) Commit `54b8ae66ae` ("kbuild: change *FLAGS_<basetarget>.o to take the path relative to $(obj)") changed the syntax of per-file compiler flags. The situation is the same for the following variables: OBJECT_FILES_NON_STANDARD_<basetarget>.o GCOV_PROFILE_<basetarget>.o KASAN_SANITIZE_<basetarget>.o KMSAN_SANITIZE_<basetarget>.o KMSAN_ENABLE_CHECKS_<basetarget>.o UBSAN_SANITIZE_<basetarget>.o KCOV_INSTRUMENT_<basetarget>.o KCSAN_SANITIZE_<basetarget>.o KCSAN_INSTRUMENT_BARRIERS_<basetarget>.o The <basetarget> is the filename of the target with its directory and suffix stripped. This syntax comes into a trouble when two files with the same basename appear in one Makefile, for example: obj-y += dir1/foo.o obj-y += dir2/foo.o OBJECT_FILES_NON_STANDARD_foo.o := y OBJECT_FILES_NON_STANDARD_foo.o is applied to both dir1/foo.o and dir2/foo.o. This syntax is not flexbile enough to handle cases where one of them is a standard object, but the other is not. It is more sensible to use the relative path to the Makefile, like this: obj-y += dir1/foo.o OBJECT_FILES_NON_STANDARD_dir1/foo.o := y obj-y += dir2/foo.o OBJECT_FILES_NON_STANDARD_dir2/foo.o := y To maintain the current behavior, I made adjustments to the following two Makefiles: - arch/x86/entry/vdso/Makefile, which compiles vclock_gettime.o, vgetcpu.o, and their vdso32 variants. - arch/x86/kvm/Makefile, which compiles vmx/vmenter.o and svm/vmenter.o Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Acked-by: Sean Christopherson <seanjc@google.com>	2024-02-23 21:06:21 +09:00
Paolo Bonzini	0cbca1bf44	x86: irq: unconditionally define KVM interrupt vectors Unlike arch/x86/kernel/idt.c, FRED support chose to remove the #ifdefs from the .c files and concentrate them in the headers, where unused handlers are #define'd to NULL. However, the constants for KVM's 3 posted interrupt vectors are still defined conditionally in irq_vectors.h. In the tree that FRED support was developed on, this is innocuous because CONFIG_HAVE_KVM was effectively always set. With the cleanups that recently went into the KVM tree to remove CONFIG_HAVE_KVM, the conditional became IS_ENABLED(CONFIG_KVM). This causes a linux-next compilation failure in FRED code, when CONFIG_KVM=n. In preparation for the merging of FRED in Linux 6.9, define the interrupt vector numbers unconditionally. Cc: x86@kernel.org Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Suggested-by: Xin Li (Intel) <xin@zytor.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-23 05:23:14 -05:00
Björn Töpel	5b98d210ac	genirq/matrix: Dynamic bitmap allocation A future user of the matrix allocator, does not know the size of the matrix bitmaps at compile time. To avoid wasting memory on unnecessary large bitmaps, size the bitmap at matrix allocation time. Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240222094006.1030709-11-apatel@ventanamicro.com	2024-02-23 10:18:44 +01:00
Sean Christopherson	5ef1d8c1dd	KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region() Do the cache flush of converted pages in svm_register_enc_region() before dropping kvm->lock to fix use-after-free issues where region and/or its array of pages could be freed by a different task, e.g. if userspace has __unregister_enc_region_locked() already queued up for the region. Note, the "obvious" alternative of using local variables doesn't fully resolve the bug, as region->pages is also dynamically allocated. I.e. the region structure itself would be fine, but region->pages could be freed. Flushing multiple pages under kvm->lock is unfortunate, but the entire flow is a rare slow path, and the manual flush is only needed on CPUs that lack coherency for encrypted memory. Fixes: `19a23da539` ("Fix unsynchronized access to sev members through svm_register_enc_region") Reported-by: Gabe Kirkpatrick <gkirkpatrick@google.com> Cc: Josh Eads <josheads@google.com> Cc: Peter Gonda <pgonda@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20240217013430.2079561-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-23 03:55:59 -05:00
Sean Christopherson	a1176ef5c9	KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU Advertise and support software-protected VMs if and only if the TDP MMU is enabled, i.e. disallow KVM_SW_PROTECTED_VM if TDP is enabled for KVM's legacy/shadow MMU. TDP support for the shadow MMU is maintenance-only, e.g. support for TDX and SNP will also be restricted to the TDP MMU. Fixes: `89ea60c2c7` ("KVM: x86: Add support for "protected VMs" that can utilize private memory") Link: https://lore.kernel.org/r/20240222190612.2942589-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 17:07:06 -08:00
Sean Christopherson	422692098c	KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIP Rewrite the help message for KVM_SW_PROTECTED_VM to make it clear that software-protected VMs are a development and testing vehicle for guest_memfd(), and that attempting to use KVM_SW_PROTECTED_VM for anything remotely resembling a "real" VM will fail. E.g. any memory accesses from KVM will incorrectly access shared memory, nested TDP is wildly broken, and so on and so forth. Update KVM's API documentation with similar warnings to discourage anyone from attempting to run anything but selftests with KVM_X86_SW_PROTECTED_VM. Fixes: `89ea60c2c7` ("KVM: x86: Add support for "protected VMs" that can utilize private memory") Link: https://lore.kernel.org/r/20240222190612.2942589-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 17:07:06 -08:00
Sean Christopherson	576a15de8d	KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read Free TDP MMU roots from vCPU context while holding mmu_lock for read, it is completely legal to invoke kvm_tdp_mmu_put_root() as a reader. This eliminates the last mmu_lock writer in the TDP MMU's "fast zap" path after requesting vCPUs to reload roots, i.e. allows KVM to zap invalidated roots, free obsolete roots, and allocate new roots in parallel. On large VMs, e.g. 100+ vCPUs, allowing the bulk of the "fast zap" operation to run in parallel with freeing and allocating roots reduces the worst case latency for a vCPU to reload a root from 2-3ms to <100us. Link: https://lore.kernel.org/r/20240111020048.844847-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:28:45 -08:00
Sean Christopherson	dab285e4ec	KVM: x86/mmu: Alloc TDP MMU roots while holding mmu_lock for read Allocate TDP MMU roots while holding mmu_lock for read, and instead use tdp_mmu_pages_lock to guard against duplicate roots. This allows KVM to create new roots without forcing kvm_tdp_mmu_zap_invalidated_roots() to yield, e.g. allows vCPUs to load new roots after memslot deletion without forcing the zap thread to detect contention and yield (or complete if the kernel isn't preemptible). Note, creating a new TDP MMU root as an mmu_lock reader is safe for two reasons: (1) paths that must guarantee all roots/SPTEs are visited take mmu_lock for write and so are still mutually exclusive, e.g. mmu_notifier invalidations, and (2) paths that require all roots/SPTEs to observe some given state without holding mmu_lock for write must ensure freshness through some other means, e.g. toggling dirty logging must first wait for SRCU readers to recognize the memslot flags change before processing existing roots/SPTEs. Link: https://lore.kernel.org/r/20240111020048.844847-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:28:45 -08:00
Sean Christopherson	f5238c2a60	KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for read When allocating a new TDP MMU root, check for a usable root while holding mmu_lock for read and only acquire mmu_lock for write if a new root needs to be created. There is no need to serialize other MMU operations if a vCPU is simply grabbing a reference to an existing root, holding mmu_lock for write is "necessary" (spoiler alert, it's not strictly necessary) only to ensure KVM doesn't end up with duplicate roots. Allowing vCPUs to get "new" roots in parallel is beneficial to VM boot and to setups that frequently delete memslots, i.e. which force all vCPUs to reload all roots. Link: https://lore.kernel.org/r/20240111020048.844847-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:28:45 -08:00
Sean Christopherson	d746182337	KVM: x86/mmu: Skip invalid TDP MMU roots when write-protecting SPTEs When write-protecting SPTEs, don't process invalid roots as invalid roots are unreachable, i.e. can't be used to access guest memory and thus don't need to be write-protected. Note, this is almost a nop for kvm_tdp_mmu_clear_dirty_pt_masked(), which is called under slots_lock, i.e. is mutually exclusive with kvm_mmu_zap_all_fast(). But it's possible for something other than the "fast zap" thread to grab a reference to an invalid root and thus keep a root alive (but completely empty) after kvm_mmu_zap_all_fast() completes. The kvm_tdp_mmu_write_protect_gfn() case is more interesting as KVM write- protects SPTEs for reasons other than dirty logging, e.g. if a KVM creates a SPTE for a nested VM while a fast zap is in-progress. Add another TDP MMU iterator to visit only valid roots, and opportunistically convert kvm_tdp_mmu_get_vcpu_root_hpa() to said iterator. Link: https://lore.kernel.org/r/20240111020048.844847-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:28:45 -08:00
Sean Christopherson	99b85fda91	KVM: x86/mmu: Skip invalid roots when zapping leaf SPTEs for GFN range When zapping a GFN in response to an APICv or MTRR change, don't zap SPTEs for invalid roots as KVM only needs to ensure the guest can't use stale mappings for the GFN. Unlike kvm_tdp_mmu_unmap_gfn_range(), which must zap "unreachable" SPTEs to ensure KVM doesn't mark a page accessed/dirty, kvm_tdp_mmu_zap_leafs() isn't used (and isn't intended to be used) to handle freeing of host memory. Link: https://lore.kernel.org/r/20240111020048.844847-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:28:45 -08:00
Sean Christopherson	6577f1efdf	KVM: x86/mmu: Allow passing '-1' for "all" as_id for TDP MMU iterators Modify for_each_tdp_mmu_root() and __for_each_tdp_mmu_root_yield_safe() to accept -1 for _as_id to mean "process all memslot address spaces". That way code that wants to process both SMM and !SMM doesn't need to iterate over roots twice (and likely copy+paste code in the process). Deliberately don't cast _as_id to an "int", just in case not casting helps the compiler elide the "_as_id >=0" check when being passed an unsigned value, e.g. from a memslot. No functional change intended. Link: https://lore.kernel.org/r/20240111020048.844847-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:28:45 -08:00
Sean Christopherson	fcdffe97f8	KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots Don't force a TLB flush when zapping SPTEs in invalid roots as vCPUs can't be actively using invalid roots (zapping SPTEs in invalid roots is necessary only to ensure KVM doesn't mark a page accessed/dirty after it is freed by the primary MMU). Link: https://lore.kernel.org/r/20240111020048.844847-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:28:45 -08:00
Sean Christopherson	8ca983631f	KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity Zap invalidated TDP MMU roots at maximum granularity, i.e. with more frequent conditional resched checkpoints, in order to avoid running for an extended duration (milliseconds, or worse) without honoring a reschedule request. And for kernels running with full or real-time preempt models, zapping at 4KiB granularity also provides significantly reduced latency for other tasks that are contending for mmu_lock (which isn't necessarily an overall win for KVM, but KVM should do its best to honor the kernel's preemption model). To keep KVM's assertion that zapping at 1GiB granularity is functionally ok, which is the main reason 1GiB was selected in the past, skip straight to zapping at 1GiB if KVM is configured to prove the MMU. Zapping roots is far more common than a vCPU replacing a 1GiB page table with a hugepage, e.g. generally happens multiple times during boot, and so keeping the test coverage provided by root zaps is desirable, just not for production. Cc: David Matlack <dmatlack@google.com> Cc: Pattara Teerapong <pteerapong@google.com> Link: https://lore.kernel.org/r/20240111020048.844847-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:28:45 -08:00
Sean Christopherson	322d79f1db	KVM: x86: Clean up directed yield API for "has pending interrupt" Directly return the boolean result of whether or not a vCPU has a pending interrupt instead of effectively doing: if (true) return true; return false; Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:27:40 -08:00
Sean Christopherson	9b8615c5d3	KVM: x86: Rely solely on preempted_in_kernel flag for directed yield Snapshot preempted_in_kernel using kvm_arch_vcpu_in_kernel() so that the flag is "accurate" (or rather, consistent and deterministic within KVM) for guests with protected state, and explicitly use preempted_in_kernel when checking if a vCPU was preempted in kernel mode instead of bouncing through kvm_arch_vcpu_in_kernel(). Drop the gnarly logic in kvm_arch_vcpu_in_kernel() that redirects to preempted_in_kernel if the target vCPU is not the "running", i.e. loaded, vCPU, as the only reason that code existed was for the directed yield case where KVM wants to check the CPL of a vCPU that may or may not be loaded on the current pCPU. Cc: Like Xu <like.xu.linux@gmail.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:27:03 -08:00
Sean Christopherson	77bcd9e623	KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel Plumb in a dedicated hook for querying whether or not a vCPU was preempted in-kernel. Unlike literally every other architecture, x86's VMX can check if a vCPU is in kernel context if and only if the vCPU is loaded on the current pCPU. x86's kvm_arch_vcpu_in_kernel() works around the limitation by querying kvm_get_running_vcpu() and redirecting to vcpu->arch.preempted_in_kernel as needed. But that's unnecessary, confusing, and fragile, e.g. x86 has had at least one bug where KVM incorrectly used a stale preempted_in_kernel. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:26:26 -08:00
Sean Christopherson	fc3c94142b	KVM: x86: Sanity check that kvm_has_noapic_vcpu is zero at module_exit() WARN if kvm.ko is unloaded with an elevated kvm_has_noapic_vcpu to guard against incorrect management of the key, e.g. to detect if KVM fails to decrement the key in error paths. Because kvm_has_noapic_vcpu is purely an optimization, in all likelihood KVM could completely botch handling of kvm_has_noapic_vcpu and no one would notice (which is a good argument for deleting the key entirely, but that's a problem for another day). Note, ideally the sanity check would be performance when kvm_usage_count goes to zero, but adding an arch callback just for this sanity check isn't at all worth doing. Link: https://lore.kernel.org/r/20240209222047.394389-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:24:26 -08:00
Sean Christopherson	a78d904669	KVM: x86: Move "KVM no-APIC vCPU" key management into local APIC code Move incrementing and decrementing of kvm_has_noapic_vcpu into kvm_create_lapic() and kvm_free_lapic() respectively to fix a benign bug where KVM fails to decrement the count if vCPU creation ultimately fails, e.g. due to a memory allocation failing. Note, the bug is benign as kvm_has_noapic_vcpu is used purely to optimize lapic_in_kernel() checks, and that optimization is quite dubious. That, and practically speaking no setup that cares at all about performance runs with a userspace local APIC. Reported-by: Li RongQing <lirongqing@baidu.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com> Link: https://lore.kernel.org/r/20240209222047.394389-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:24:09 -08:00
Sean Christopherson	0ec3d6d1f1	KVM: x86: Fully defer to vendor code to decide how to force immediate exit Now that vmx->req_immediate_exit is used only in the scope of vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp the VMX preemption to force a VM-Exit and let vendor code fully handle forcing a VM-Exit. Opportunsitically drop __kvm_request_immediate_exit() and just have vendor code call smp_send_reschedule() directly. SVM already does this when injecting an event while also trying to single-step an IRET, i.e. it's not exactly secret knowledge that KVM uses a reschedule IPI to force an exit. Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:22:41 -08:00
Sean Christopherson	7b3d1bbf8d	KVM: VMX: Handle KVM-induced preemption timer exits in fastpath for L2 Eat VMX treemption timer exits in the fastpath regardless of whether L1 or L2 is active. The VM-Exit is 100% KVM-induced, i.e. there is nothing directly related to the exit that KVM needs to do on behalf of the guest, thus there is no reason to wait until the slow path to do nothing. Opportunistically add comments explaining why preemption timer exits for emulating the guest's APIC timer need to go down the slow path. Link: https://lore.kernel.org/r/20240110012705.506918-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:22:37 -08:00
Sean Christopherson	bf1a49436e	KVM: x86: Move handling of is_guest_mode() into fastpath exit handlers Let the fastpath code decide which exits can/can't be handled in the fastpath when L2 is active, e.g. when KVM generates a VMX preemption timer exit to forcefully regain control, there is no "work" to be done and so such exits can be handled in the fastpath regardless of whether L1 or L2 is active. Moving the is_guest_mode() check into the fastpath code also makes it easier to see that L2 isn't allowed to use the fastpath in most cases, e.g. it's not immediately obvious why handle_fastpath_preemption_timer() is called from the fastpath and the normal path. Link: https://lore.kernel.org/r/20240110012705.506918-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:22:36 -08:00
Sean Christopherson	11776aa0cf	KVM: VMX: Handle forced exit due to preemption timer in fastpath Handle VMX preemption timer VM-Exits due to KVM forcing an exit in the exit fastpath, i.e. avoid calling back into handle_preemption_timer() for the same exit. There is no work to be done for forced exits, as the name suggests the goal is purely to get control back in KVM. In addition to shaving a few cycles, this will allow cleanly separating handle_fastpath_preemption_timer() from handle_preemption_timer(), e.g. it's not immediately obvious why _apparently_ calling handle_fastpath_preemption_timer() twice on a "slow" exit is necessary: the "slow" call is necessary to handle exits from L2, which are excluded from the fastpath by vmx_vcpu_run(). Link: https://lore.kernel.org/r/20240110012705.506918-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:22:36 -08:00
Sean Christopherson	e6b5d16bbd	KVM: VMX: Re-enter guest in fastpath for "spurious" preemption timer exits Re-enter the guest in the fast path if VMX preeemption timer VM-Exit was "spurious", i.e. if KVM "soft disabled" the timer by writing -1u and by some miracle the timer expired before any other VM-Exit occurred. This is just an intermediate step to cleaning up the preemption timer handling, optimizing these types of spurious VM-Exits is not interesting as they are extremely rare/infrequent. Link: https://lore.kernel.org/r/20240110012705.506918-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:22:36 -08:00
Sean Christopherson	9c9025ea00	KVM: x86: Plumb "force_immediate_exit" into kvm_entry() tracepoint Annotate the kvm_entry() tracepoint with "immediate exit" when KVM is forcing a VM-Exit immediately after VM-Enter, e.g. when KVM wants to inject an event but needs to first complete some other operation. Knowing that KVM is (or isn't) forcing an exit is useful information when debugging issues related to event injection. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240110012705.506918-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:22:36 -08:00
Sean Christopherson	dfeef3d3f3	KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag Remove reexecute_instruction()'s final check on the MMU being direct, as EMULTYPE_WRITE_PF_TO_SP is only ever set if the MMU is indirect, i.e. is a shadow MMU. Prior to commit `93c05d3ef2` ("KVM: x86: improve reexecute_instruction"), the flag simply didn't exist (and KVM actually returned "true" unconditionally for both types of MMUs). I.e. the explicit check for a direct MMU is simply leftover artifact from old code. Link: https://lore.kernel.org/r/20240203002343.383056-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:19:06 -08:00
Sean Christopherson	515c18a64e	KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction() Now that KVM doesn't pointlessly acquire mmu_lock for direct MMUs, drop the dedicated path entirely and always query indirect_shadow_pages when deciding whether or not to try unprotecting the gfn. For indirect, a.k.a. shadow MMUs, checking indirect_shadow_pages is harmless; unless every shadow page was somehow zapped while KVM was attempting to emulate the instruction, indirect_shadow_pages is guaranteed to be non-zero. Well, unless the instruction used a direct hugepage with 2-level paging for its code page, but in that case, there's obviously nothing to unprotect. And in the extremely unlikely case all shadow pages were zapped, there's again obviously nothing to unprotect. Link: https://lore.kernel.org/r/20240203002343.383056-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:19:06 -08:00
Mingwei Zhang	474b99ed70	KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic Drop KVM's completely pointless acquisition of mmu_lock when deciding whether or not to unprotect any shadow pages residing at the gfn before resuming the guest to let it retry an instruction that KVM failed to emulated. In this case, indirect_shadow_pages is used as a coarse-grained heuristic to check if there is any chance of there being a relevant shadow page to unprotected. But acquiring mmu_lock largely defeats any benefit to the heuristic, as taking mmu_lock for write is likely far more costly to the VM as a whole than unnecessarily walking mmu_page_hash. Furthermore, the current code is already prone to false negatives and false positives, as it drops mmu_lock before checking the flag and unprotecting shadow pages. And as evidenced by the lack of bug reports, neither false positives nor false negatives are problematic. A false positive simply means that KVM will try to unprotect shadow pages that have already been zapped. And a false negative means that KVM will resume the guest without unprotecting the gfn, i.e. if a shadow page was _just_ created, the vCPU will hit the same page fault and do the whole dance all over again, and detect and unprotect the shadow page the second time around (or not, if something else zaps it first). Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: drop READ_ONCE() and comment change, rewrite changelog] Link: https://lore.kernel.org/r/20240203002343.383056-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:19:06 -08:00
Sean Christopherson	2a5f091ce1	KVM: x86: Open code all direct reads to guest DR6 and DR7 Bite the bullet, and open code all direct reads of DR6 and DR7. KVM currently has a mix of open coded accesses and calls to kvm_get_dr(), which is confusing and ugly because there's no rhyme or reason as to why any particular chunk of code uses kvm_get_dr(). The obvious alternative is to force all accesses through kvm_get_dr(), but it's not at all clear that doing so would be a net positive, e.g. even if KVM ends up wanting/needing to force all reads through a common helper, e.g. to play caching games, the cost of reverting this change is likely lower than the ongoing cost of maintaining weird, arbitrary code. No functional change intended. Cc: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240209220752.388160-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:14:47 -08:00
Sean Christopherson	fc5375dd8c	KVM: x86: Make kvm_get_dr() return a value, not use an out parameter Convert kvm_get_dr()'s output parameter to a return value, and clean up most of the mess that was created by forcing callers to provide a pointer. No functional change intended. Acked-by: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240209220752.388160-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:14:47 -08:00
Sean Christopherson	b1a3c366cb	x86/cpu: Add a VMX flag to enumerate 5-level EPT support to userspace Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query whether or not the CPU supports 5-level EPT paging. EPT capabilities are enumerated via MSR, i.e. aren't accessible to userspace without help from the kernel, and knowing whether or not 5-level EPT is supported is useful for debug, triage, testing, etc. For example, when EPT is enabled, bits 51:48 of guest physical addresses are consumed by the CPU if and only if 5-level EPT is enabled. For CPUs with MAXPHYADDR > 48, KVM can't map all legal guest memory without 5-level EPT, making 5-level EPT support valuable information for userspace. Reported-by: Yi Lai <yi1.lai@intel.com> Cc: Tao Su <tao1.su@linux.intel.com> Cc: Xudong Hao <xudong.hao@intel.com> Link: https://lore.kernel.org/r/20240110002340.485595-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:03:56 -08:00
Nathan Chancellor	22d3da073f	x86: drop stack-alignment plugin opt Now that the minimum supported version of LLVM for building the kernel has been bumped to 13.0.1, the inner ifeq statement is always false, as the build will fail during the configuration stage for older LLVM versions. This effectively reverts part of commit `b33fff07e3` ("x86, build: allow LTO to be selected") and its follow up fix, commit `2398ce8015` ("x86, lto: Pass -stack-alignment only on LLD < 13.0.0"). Link: https://lkml.kernel.org/r/20240125-bump-min-llvm-ver-to-13-0-1-v1-3-f5ff9bda41c5@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Conor Dooley <conor@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-22 15:38:54 -08:00
Nathan Chancellor	2947a4567f	treewide: update LLVM Bugzilla links LLVM moved their issue tracker from their own Bugzilla instance to GitHub issues. While all of the links are still valid, they may not necessarily show the most up to date information around the issues, as all updates will occur on GitHub, not Bugzilla. Another complication is that the Bugzilla issue number is not always the same as the GitHub issue number. Thankfully, LLVM maintains this mapping through two shortlinks: https://llvm.org/bz<num> -> https://bugs.llvm.org/show_bug.cgi?id=<num> https://llvm.org/pr<num> -> https://github.com/llvm/llvm-project/issues/<mapped_num> Switch all "https://bugs.llvm.org/show_bug.cgi?id=<num>" links to the "https://llvm.org/pr<num>" shortlink so that the links show the most up to date information. Each migrated issue links back to the Bugzilla entry, so there should be no loss of fidelity of information here. Link: https://lkml.kernel.org/r/20240109-update-llvm-links-v1-3-eb09b59db071@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Fangrui Song <maskray@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Mykola Lysenko <mykolal@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-22 15:38:51 -08:00
Jakub Kicinski	fecc51559a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/udp.c `f796feabb9` ("udp: add local "peek offset enabled" flag") `56667da739` ("net: implement lockless setsockopt(SO_PEEK_OFF)") Adjacent changes: net/unix/garbage.c `aa82ac51d6` ("af_unix: Drop oob_skb ref before purging queue in GC.") `11498715f2` ("af_unix: Remove io_uring code for GC.") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-22 15:29:26 -08:00
Ryan Roberts	506b586769	x86/mm: convert pte_next_pfn() to pte_advance_pfn() Core-mm needs to be able to advance the pfn by an arbitrary amount, so override the new pte_advance_pfn() API to do so. Link: https://lkml.kernel.org/r/20240215103205.2607016-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Barry Song <21cnbao@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-22 15:27:18 -08:00
Chris Koch	43b1d3e68e	kexec: Allocate kernel above bzImage's pref_address A relocatable kernel will relocate itself to pref_address if it is loaded below pref_address. This means a booted kernel may be relocating itself to an area with reserved memory on modern systems, potentially clobbering arbitrary data that may be important to the system. This is often the case, as the default value of PHYSICAL_START is 0x1000000 and kernels are typically loaded at 0x100000 or above by bootloaders like iPXE or kexec. GRUB behaves like the approach implemented here. Also fixes the documentation around pref_address and PHYSICAL_START to be accurate. [ dhansen: changelog tweak ] Co-developed-by: Cloud Hsu <cloudhsu@google.com> Signed-off-by: Cloud Hsu <cloudhsu@google.com> Signed-off-by: Chris Koch <chrisko@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Link: https://lore.kernel.org/all/20231215190521.3796022-1-chrisko%40google.com	2024-02-22 15:13:57 -08:00
Kai Huang	5bdd181821	x86/asm: Remove the __iomem annotation of movdir64b()'s dst argument Commit `e56d28df2f` ("x86/virt/tdx: Configure global KeyID on all packages") causes a sparse warning: arch/x86/virt/vmx/tdx/tdx.c:683:27: warning: incorrect type in argument 1 (different address spaces) arch/x86/virt/vmx/tdx/tdx.c:683:27: expected void [noderef] __iomem dst arch/x86/virt/vmx/tdx/tdx.c:683:27: got void The reason is TDX must use the MOVDIR64B instruction to convert TDX private memory (which is normal RAM but not MMIO) back to normal. The TDX code uses existing movdir64b() helper to do that, but the first argument @dst of movdir64b() is annotated with __iomem. When movdir64b() was firstly introduced in commit `0888e1030d` ("x86/asm: Carve out a generic movdir64b() helper for general usage"), it didn't have the __iomem annotation. But this commit also introduced the same "incorrect type" sparse warning because the iosubmit_cmds512(), which was the solo caller of movdir64b(), has the __iomem annotation. This was later fixed by commit `6ae58d8713` ("x86/asm: Annotate movdir64b()'s dst argument with __iomem"). That fix was reasonable because until TDX code the movdir64b() was only used to move data to MMIO location, as described by the commit message: ... The current usages send a 64-bytes command descriptor to an MMIO location (portal) on a device for consumption. When future usages for the MOVDIR64B instruction warrant a separate variant of a memory to memory operation, the argument annotation can be revisited. Now TDX code uses MOVDIR64B to move data to normal memory so it's time to revisit. The SDM says the destination of MOVDIR64B is "memory location specified in a general register", thus it's more reasonable that movdir64b() does not have the __iomem annotation on the @dst. Remove the __iomem annotation from the @dst argument of movdir64b() to fix the sparse warning in TDX code. Similar to memset_io(), introduce a new movdir64b_io() to cover the case where the destination is an MMIO location, and change the solo caller iosubmit_cmds512() to use the new movdir64b_io(). In movdir64b_io() explicitly use __force in the type casting otherwise there will be below sparse warning: warning: cast removes address space '__iomem' of expression [ dhansen: normal changelog tweaks ] Closes: https://lore.kernel.org/oe-kbuild-all/202312311924.tGjsBIQD-lkp@intel.com/ Fixes: `e56d28df2f` ("x86/virt/tdx: Configure global KeyID on all packages") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/all/20240126023852.11065-1-kai.huang%40intel.com	2024-02-22 14:52:09 -08:00
Rick Edgecombe	82ace18501	x86/mm/cpa: Warn for set_memory_XXcrypted() VMM fails On TDX it is possible for the untrusted host to cause set_memory_encrypted() or set_memory_decrypted() to fail such that an error is returned and the resulting memory is shared. Callers need to take care to handle these errors to avoid returning decrypted (shared) memory to the page allocator, which could lead to functional or security issues. In terms of security, the problematic case is guest PTEs mapping the shared alias GFNs, since the VMM has control of the shared mapping in the EPT/NPT. Such conversion errors may herald future system instability, but are temporarily survivable with proper handling in the caller. The kernel traditionally makes every effort to keep running, but it is expected that some coco guests may prefer to play it safe security-wise, and panic in this case. To accommodate both cases, warn when the arch breakouts for converting memory at the VMM layer return an error to CPA. Security focused users can rely on panic_on_warn to defend against bugs in the callers. Some VMMs are not known to behave in the troublesome way, so users that would like to terminate on any unusual behavior by the VMM around this will be covered as well. Since the arch breakouts host the logic for handling coco implementation specific errors, an error returned from them means that the set_memory() call is out of options for handling the error internally. Make this the condition to warn about. It is possible that very rarely these functions could fail due to guest memory pressure (in the case of failing to allocate a huge page when splitting a page table). Don't warn in this case because it is a lot less likely to indicate an attack by the host and it is not clear which set_memory() calls should get the same treatment. That corner should be addressed by future work that considers the more general problem and not just papers over a single set_memory() variant. Suggested-by: Michael Kelley (LINUX) <mikelley@microsoft.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/all/20240122184003.129104-1-rick.p.edgecombe%40intel.com	2024-02-22 14:25:41 -08:00
Christophe Leroy	6cdc82db0c	mm: ptdump: have ptdump_check_wx() return bool Have ptdump_check_wx() return true when the check is successful or false otherwise. [akpm@linux-foundation.org: fix a couple of build issues (x86_64 allmodconfig)] Link: https://lkml.kernel.org/r/7943149fe955458cb7b57cd483bf41a3aad94684.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-22 10:24:47 -08:00
Christophe Leroy	a5e8131a03	arm64, powerpc, riscv, s390, x86: ptdump: refactor CONFIG_DEBUG_WX All architectures using the core ptdump functionality also implement CONFIG_DEBUG_WX, and they all do it more or less the same way, with a function called debug_checkwx() that is called by mark_rodata_ro(), which is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is set and a no-op otherwise. Refactor by centrally defining debug_checkwx() in linux/ptdump.h and call debug_checkwx() immediately after calling mark_rodata_ro() instead of calling it at the end of every mark_rodata_ro(). On x86_32, mark_rodata_ro() first checks __supported_pte_mask has _PAGE_NX before calling debug_checkwx(). Now the check is inside the callee ptdump_walk_pgd_level_checkwx(). On powerpc_64, mark_rodata_ro() bails out early before calling ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature. The check is now also done in ptdump_check_wx() as it is called outside mark_rodata_ro(). Link: https://lkml.kernel.org/r/a59b102d7964261d31ead0316a9f18628e4e7a8e.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-22 10:24:47 -08:00
Yosry Ahmed	3cfd6625a6	x86/mm: clarify "prev" usage in switch_mm_irqs_off() In the x86 implementation of switch_mm_irqs_off(), we do not use the "prev" argument passed in by the caller, we use exclusively use "real_prev", which is cpu_tlbstate.loaded_mm. This is not obvious at the first sight. Furthermore, a comment describes a condition that happens when called with prev == next, but this should not affect the function in any way since prev is unused. Apparently, the comment is intended to clarify why we don't rely on prev == next to decide whether we need to update CR3, but again, it is not obvious. The comment also references the fact that leave_mm() calls with prev == NULL and tsk == NULL, but this also shouldn't matter because prev is unused and tsk is only used in one function which has a NULL check. Clarify things by renaming (prev -> unused) and (real_prev -> prev), also move and rewrite the comment as an explanation for why we don't rely on "prev" supplied by the caller in x86 code and use our own. Hopefully this makes reading the code easier. Link: https://lkml.kernel.org/r/20240126080644.1714297-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-22 10:24:42 -08:00
Yosry Ahmed	7dbbc8f57d	x86/mm: delete unused cpu argument to leave_mm() The argument is unused since commit `3d28ebceaf` ("x86/mm: Rework lazy TLB to track the actual loaded mm"), delete it. Link: https://lkml.kernel.org/r/20240126080644.1714297-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-22 10:24:41 -08:00
Linus Torvalds	6714ebb922	Including fixes from bpf and netfilter. Current release - regressions: - af_unix: fix another unix GC hangup Previous releases - regressions: - core: fix a possible AF_UNIX deadlock - bpf: fix NULL pointer dereference in sk_psock_verdict_data_ready() - netfilter: nft_flow_offload: release dst in case direct xmit path is used - bridge: switchdev: ensure MDB events are delivered exactly once - l2tp: pass correct message length to ip6_append_data - dccp/tcp: unhash sk from ehash for tb2 alloc failure after check_estalblished() - tls: fixes for record type handling with PEEK - devlink: fix possible use-after-free and memory leaks in devlink_init() Previous releases - always broken: - bpf: fix an oops when attempting to read the vsyscall page through bpf_probe_read_kernel - sched: act_mirred: use the backlog for mirred ingress - netfilter: nft_flow_offload: fix dst refcount underflow - ipv6: sr: fix possible use-after-free and null-ptr-deref - mptcp: fix several data races - phonet: take correct lock to peek at the RX queue Misc: - handful of fixes and reliability improvements for selftests Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmXXKMMSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkmgAQAIV2NAVEvHVBtnm0Df9PuCcHQx6i9veS tGxOZMVwb5ePFI+dpiNyyn61koEiRuFLOm66pfJAuT5j5z6m4PEFfPZgtiVpCHVK 4sz4UD4+jVLmYijv+YlWkPU3RWR0RejSkDbXwY5Y9Io/DWHhA2iq5IyMy2MncUPY dUc12ddEsYRH60Kmm2/96FcdbHw9Y64mDC8tIeIlCAQfng4U98EXJbCq9WXsPPlW vjwSKwRG76QGDugss9XkatQ7Bsva1qTobFGDOvBMQpMt+dr81pTGVi0c1h/drzvI EJaDO8jJU3Xy0pQ80beboCJ1KlVCYhWSmwlBMZUA1f0lA2m3U5UFEtHA5hHKs3Mi jNe/sgKXzThrro0fishAXbzrro2QDhCG3Vm4PRlOGexIyy+n0gIp1lHwEY1p2vX9 RJPdt1e3xt/5NYRv6l2GVQYFi8Wd0endgzCdJeXk0OWQFLFtnxhG6ejpgxtgN0fp CzKU6orFpsddQtcEOdIzKMUA3CXYWAdQPXOE5Ptjoz3MXZsQqtMm3vN4and8jJ19 8/VLsCNPp11bSRTmNY3Xt85e+gjIA2mRwgRo+ieL6b1x2AqNeVizlr6IZWYQ4TdG rUdlEX0IVmov80TSeQoWgtzTO7xMER+qN6FxAs3pQoUFjtol3pEURq9FQ2QZ8jW4 5rKpNBrjKxdk =eUOc -----END PGP SIGNATURE----- Merge tag 'net-6.8.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf and netfilter. Current release - regressions: - af_unix: fix another unix GC hangup Previous releases - regressions: - core: fix a possible AF_UNIX deadlock - bpf: fix NULL pointer dereference in sk_psock_verdict_data_ready() - netfilter: nft_flow_offload: release dst in case direct xmit path is used - bridge: switchdev: ensure MDB events are delivered exactly once - l2tp: pass correct message length to ip6_append_data - dccp/tcp: unhash sk from ehash for tb2 alloc failure after check_estalblished() - tls: fixes for record type handling with PEEK - devlink: fix possible use-after-free and memory leaks in devlink_init() Previous releases - always broken: - bpf: fix an oops when attempting to read the vsyscall page through bpf_probe_read_kernel - sched: act_mirred: use the backlog for mirred ingress - netfilter: nft_flow_offload: fix dst refcount underflow - ipv6: sr: fix possible use-after-free and null-ptr-deref - mptcp: fix several data races - phonet: take correct lock to peek at the RX queue Misc: - handful of fixes and reliability improvements for selftests" * tag 'net-6.8.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) l2tp: pass correct message length to ip6_append_data net: phy: realtek: Fix rtl8211f_config_init() for RTL8211F(D)(I)-VD-CG PHY selftests: ioam: refactoring to align with the fix Fix write to cloned skb in ipv6_hop_ioam() phonet/pep: fix racy skb_queue_empty() use phonet: take correct lock to peek at the RX queue net: sparx5: Add spinlock for frame transmission from CPU net/sched: flower: Add lock protection when remove filter handle devlink: fix port dump cmd type net: stmmac: Fix EST offset for dwmac 5.10 tools: ynl: don't leak mcast_groups on init error tools: ynl: make sure we always pass yarg to mnl_cb_run net: mctp: put sock on tag allocation failure netfilter: nf_tables: use kzalloc for hook allocation netfilter: nf_tables: register hooks last when adding new chain/flowtable netfilter: nft_flow_offload: release dst in case direct xmit path is used netfilter: nft_flow_offload: reset dst in route object after setting up flow netfilter: nf_tables: set dormant flag on hook register failure selftests: tls: add test for peeking past a record of a different type selftests: tls: add test for merging of same-type control messages ...	2024-02-22 09:57:58 -08:00
James Morse	c0d848fcb0	x86/resctrl: Remove lockdep annotation that triggers false positive get_domain_from_cpu() walks a list of domains to find the one that contains the specified CPU. This needs to be protected against races with CPU hotplug when the list is modified. It has recently gained a lockdep annotation to check this. The lockdep annotation causes false positives when called via IPI as the lock is held, but by another process. Remove it. [ bp: Refresh it ontop of x86/cache. ] Fixes: `fb700810d3` ("x86/resctrl: Separate arch and fs resctrl locks") Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/ZdUSwOM9UUNpw84Y@agluck-desk3	2024-02-22 16:15:38 +01:00
Paul Durrant	003d914220	KVM: x86/xen: allow vcpu_info content to be 'safely' copied If the guest sets an explicit vcpu_info GPA then, for any of the first 32 vCPUs, the content of the default vcpu_info in the shared_info page must be copied into the new location. Because this copy may race with event delivery (which updates the 'evtchn_pending_sel' field in vcpu_info), event delivery needs to be deferred until the copy is complete. Happily there is already a shadow of 'evtchn_pending_sel' in kvm_vcpu_xen that is used in atomic context if the vcpu_info PFN cache has been invalidated so that the update of vcpu_info can be deferred until the cache can be refreshed (on vCPU thread's the way back into guest context). Use this shadow if the vcpu_info cache has been deactivated, so that the VMM can safely copy the vcpu_info content and then re-activate the cache with the new GPA. To do this, stop considering an inactive vcpu_info cache as a hard error in kvm_xen_set_evtchn_fast(), and let the existing kvm_gpc_check() fail and kick the vCPU (if necessary). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-21-paul@xen.org [sean: add a bit of verbosity to the changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 07:01:21 -08:00
Paul Durrant	615451d8cb	KVM: x86/xen: advertize the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA capability Now that all relevant kernel changes and selftests are in place, enable the new capability. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-17-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 07:01:19 -08:00
Paul Durrant	3991f35805	KVM: x86/xen: allow vcpu_info to be mapped by fixed HVA If the guest does not explicitly set the GPA of vcpu_info structure in memory then, for guests with 32 vCPUs or fewer, the vcpu_info embedded in the shared_info page may be used. As described in a previous commit, the shared_info page is an overlay at a fixed HVA within the VMM, so in this case it also more optimal to activate the vcpu_info cache with a fixed HVA to avoid unnecessary invalidation if the guest memory layout is modified. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-14-paul@xen.org [sean: use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 07:01:17 -08:00
Paul Durrant	b9220d3279	KVM: x86/xen: allow shared_info to be mapped by fixed HVA The shared_info page is not guest memory as such. It is a dedicated page allocated by the VMM and overlaid onto guest memory in a GFN chosen by the guest and specified in the XENMEM_add_to_physmap hypercall. The guest may even request that shared_info be moved from one GFN to another by re-issuing that hypercall, but the HVA is never going to change. Because the shared_info page is an overlay the memory slots need to be updated in response to the hypercall. However, memory slot adjustment is not atomic and, whilst all vCPUs are paused, there is still the possibility that events may be delivered (which requires the shared_info page to be updated) whilst the shared_info GPA is absent. The HVA is never absent though, so it makes much more sense to use that as the basis for the kernel's mapping. Hence add a new KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA attribute type for this purpose and a KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag to advertize its availability. Don't actually advertize it yet though. That will be done in a subsequent patch, which will also add tests for the new attribute type. Also update the KVM API documentation with the new attribute and also fix it up to consistently refer to 'shared_info' (with the underscore). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-13-paul@xen.org [sean: store "hva" as a user pointer, use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 07:00:52 -08:00
Nikolay Borisov	07a5d4bcbf	x86/insn: Directly assign x86_64 state in insn_init() No point in checking again as this was already done by the caller. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240222111636.2214523-3-nik.borisov@suse.com	2024-02-22 12:23:27 +01:00
Nikolay Borisov	427e1646f1	x86/insn: Remove superfluous checks from instruction decoding routines It's pointless checking if a particular part of an instruction is decoded before calling the routine responsible for decoding it as this check is duplicated in the routines itself. Streamline the code by removing the superfluous checks. No functional difference. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240222111636.2214523-2-nik.borisov@suse.com	2024-02-22 12:23:04 +01:00
Ingo Molnar	b7bcffe752	x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together The fresh changes to the vDSO Makefile in: `289d0a475c` ("x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32") `329b77b59f` ("x86/vdso: Simplify obj-y addition") Conflicted with a pending change in: `b388e57d46` ("x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o") Which was resolved in a simple fasion in this merge commit: `f14df823a6` ("Merge branch 'x86/vdso' into x86/core, to resolve conflict and to prepare for dependent changes") ... but all these changes make me look and notice a bit of historic baggage left in the Makefile: - Disordered build rules where non-standard build attributes relating to were placed sometimes several lines after - and sometimes before the .o build rules of the object files... Functional but inconsistent. - Inconsistent vertical spacing, stray whitespaces, inconsistent spelling of 'vDSO' over the years, a few spelling mistakes and inconsistent capitalization of comment blocks. Tidy it all up. No functional changes intended. Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-02-22 10:40:20 +01:00
Ingo Molnar	f14df823a6	Merge branch 'x86/vdso' into x86/core, to resolve conflict and to prepare for dependent changes Conflicts: arch/x86/entry/vdso/Makefile We also want to change arch/x86/entry/vdso/Makefile in a followup commit, so merge the trees for this. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-02-22 10:16:53 +01:00
Paolo Abeni	fdcd4467ba	bpf-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZdaBCwAKCRDbK58LschI g3EhAP0d+S18mNabiEGz8efnE2yz3XcFchJgjiRS8WjOv75GvQEA6/sWncFjbc8k EqxPHmeJa19rWhQlFrmlyNQfLYGe4gY= =VkOs -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-02-22 The following pull-request contains BPF updates for your net tree. We've added 11 non-merge commits during the last 24 day(s) which contain a total of 15 files changed, 217 insertions(+), 17 deletions(-). The main changes are: 1) Fix a syzkaller-triggered oops when attempting to read the vsyscall page through bpf_probe_read_kernel and friends, from Hou Tao. 2) Fix a kernel panic due to uninitialized iter position pointer in bpf_iter_task, from Yafang Shao. 3) Fix a race between bpf_timer_cancel_and_free and bpf_timer_cancel, from Martin KaFai Lau. 4) Fix a xsk warning in skb_add_rx_frag() (under CONFIG_DEBUG_NET) due to incorrect truesize accounting, from Sebastian Andrzej Siewior. 5) Fix a NULL pointer dereference in sk_psock_verdict_data_ready, from Shigeru Yoshida. 6) Fix a resolve_btfids warning when bpf_cpumask symbol cannot be resolved, from Hari Bathini. bpf-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, sockmap: Fix NULL pointer dereference in sk_psock_verdict_data_ready() selftests/bpf: Add negtive test cases for task iter bpf: Fix an issue due to uninitialized bpf_iter_task selftests/bpf: Test racing between bpf_timer_cancel_and_free and bpf_timer_cancel bpf: Fix racing between bpf_timer_cancel_and_free and bpf_timer_cancel selftest/bpf: Test the read of vsyscall page under x86-64 x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h bpf, scripts: Correct GPL license name xsk: Add truesize to skb_add_rx_frag(). bpf: Fix warning for bpf_cpumask in verifier ==================== Link: https://lore.kernel.org/r/20240221231826.1404-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-02-22 10:04:47 +01:00
Kunwu Chan	e37ae6433a	x86/apm_32: Remove dead function apm_get_battery_status() This part was commented out 25 years ago in: commit d43c43b46ebfdb437b78206fcc1992c4d2e8c15e Author: linus1 <torvalds@linuxfoundation.org> Date: Tue Sep 7 11:00:00 1999 -0600 Import 2.3.26pre1 and probably no one knows why. Probably it was unused even then. Just remove it. [ bp: Expand commit message. ] Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240126030824.579711-1-chentao@kylinos.cn	2024-02-21 19:38:03 +01:00
Dan Williams	40de53fd00	Merge branch 'for-6.8/cxl-cper' into for-6.8/cxl Pick up CXL CPER notification removal for v6.8-rc6, to return in a later merge window.	2024-02-20 22:57:35 -08:00
Paul Durrant	18b99e4d6d	KVM: x86/xen: re-initialize shared_info if guest (32/64-bit) mode is set If the shared_info PFN cache has already been initialized then the content of the shared_info page needs to be re-initialized whenever the guest mode is (re)set. Setting the guest mode is either done explicitly by the VMM via the KVM_XEN_ATTR_TYPE_LONG_MODE attribute, or implicitly when the guest writes the MSR to set up the hypercall page. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-12-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-20 07:37:48 -08:00
Paul Durrant	c01c55a34f	KVM: x86/xen: separate initialization of shared_info cache and content A subsequent patch will allow shared_info to be initialized using either a GPA or a user-space (i.e. VMM) HVA. To make that patch cleaner, separate the initialization of the shared_info content from the activation of the pfncache. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-11-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-20 07:37:47 -08:00
Paul Durrant	a4bff3df51	KVM: pfncache: remove KVM_GUEST_USES_PFN usage As noted in [1] the KVM_GUEST_USES_PFN usage flag is never set by any callers of kvm_gpc_init(), and for good reason: the implementation is incomplete/broken. And it's not clear that there will ever be a user of KVM_GUEST_USES_PFN, as coordinating vCPUs with mmu_notifier events is non-trivial. Remove KVM_GUEST_USES_PFN and all related code, e.g. dropping KVM_GUEST_USES_PFN also makes the 'vcpu' argument redundant, to avoid having to reason about broken code as __kvm_gpc_refresh() evolves. Moreover, all existing callers specify KVM_HOST_USES_PFN so the usage check in hva_to_pfn_retry() and hence the 'usage' argument to kvm_gpc_init() are also redundant. [1] https://lore.kernel.org/all/ZQiR8IpqOZrOpzHC@google.com Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-6-paul@xen.org [sean: explicitly call out that guest usage is incomplete] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-20 07:37:43 -08:00
Paul Durrant	78b74638eb	KVM: pfncache: add a mark-dirty helper At the moment pages are marked dirty by open-coded calls to mark_page_dirty_in_slot(), directly deferefencing the gpa and memslot from the cache. After a subsequent patch these may not always be set so add a helper now so that caller will protected from the need to know about this detail. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-5-paul@xen.org [sean: decrease indentation, use gpa_to_gfn()] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-20 07:37:42 -08:00
Paul Durrant	4438355ec6	KVM: x86/xen: mark guest pages dirty with the pfncache lock held Sampling gpa and memslot from an unlocked pfncache may yield inconsistent values so, since there is no problem with calling mark_page_dirty_in_slot() with the pfncache lock held, relocate the calls in kvm_xen_update_runstate_guest() and kvm_xen_inject_pending_events() accordingly. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-4-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-20 07:37:41 -08:00
Kirill A. Shutemov	ffc92cf3db	x86/pat: Simplify the PAT programming protocol The programming protocol for the PAT MSR follows the MTRR programming protocol. However, this protocol is cumbersome and requires disabling caching (CR0.CD=1), which is not possible on some platforms. Specifically, a TDX guest is not allowed to set CR0.CD. It triggers a #VE exception. It turns out that the requirement to follow the MTRR programming protocol for PAT programming is unnecessarily strict. The new Intel Software Developer Manual (http://www.intel.com/sdm) (December 2023) relaxes this requirement, please refer to the section titled "Programming the PAT" for more information. In short, this section provides an alternative PAT update sequence which doesn't need to disable caches around the PAT update but only to flush those caches and TLBs. The AMD documentation does not link PAT programming to MTRR and is there fore, fine too. The kernel only needs to flush the TLB after updating the PAT MSR. The set_memory code already takes care of flushing the TLB and cache when changing the memory type of a page. [ bp: Expand commit message. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240124130650.496056-1-kirill.shutemov@linux.intel.com	2024-02-20 14:40:51 +01:00
Josh Poimboeuf	b388e57d46	x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o For CONFIG_RETHUNK kernels, objtool annotates all the function return sites so they can be patched during boot. By design, after apply_returns() is called, all tail-calls to the compiler-generated default return thunk (__x86_return_thunk) should be patched out and replaced with whatever's needed for any mitigations (or lack thereof). The commit `4461438a84` ("x86/retpoline: Ensure default return thunk isn't used at runtime") adds a runtime check and a WARN_ONCE() if the default return thunk ever gets executed after alternatives have been applied. This warning is a sanity check to make sure objtool and apply_returns() are doing their job. As Nathan reported, that check found something: Unpatched return thunk in use. This should not happen! WARNING: CPU: 0 PID: 1 at arch/x86/kernel/cpu/bugs.c:2856 __warn_thunk+0x27/0x40 RIP: 0010:__warn_thunk+0x27/0x40 Call Trace: <TASK> ? show_regs ? __warn ? __warn_thunk ? report_bug ? console_unlock ? handle_bug ? exc_invalid_op ? asm_exc_invalid_op ? ia32_binfmt_init ? __warn_thunk warn_thunk_thunk do_one_initcall kernel_init_freeable ? __pfx_kernel_init kernel_init ret_from_fork ? __pfx_kernel_init ret_from_fork_asm </TASK> Boris debugged to find that the unpatched return site was in init_vdso_image_64(), and its translation unit wasn't being analyzed by objtool, so it never got annotated. So it got ignored by apply_returns(). This is only a minor issue, as this function is only called during boot. Still, objtool needs full visibility to the kernel. Fix it by enabling objtool on vdso-image-{32,64}.o. Note this problem can only be seen with !CONFIG_X86_KERNEL_IBT, as that requires objtool to run individually on all translation units rather on vmlinux.o. [ bp: Massage commit message. ] Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240215032049.GA3944823@dev-arch.thelio-3990X	2024-02-20 13:26:10 +01:00
Masahiro Yamada	cd14b01846	treewide: replace or remove redundant def_bool in Kconfig files 'def_bool X' is a shorthand for 'bool' plus 'default X'. 'def_bool' is redundant where 'bool' is already present, so 'def_bool X' can be replaced with 'default X', or removed if X is 'n'. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2024-02-20 20:47:45 +09:00
Pawan Gupta	43fb862de8	KVM/VMX: Move VERW closer to VMentry for MDS mitigation During VMentry VERW is executed to mitigate MDS. After VERW, any memory access like register push onto stack may put host data in MDS affected CPU buffers. A guest can then use MDS to sample host data. Although likelihood of secrets surviving in registers at current VERW callsite is less, but it can't be ruled out. Harden the MDS mitigation by moving the VERW mitigation late in VMentry path. Note that VERW for MMIO Stale Data mitigation is unchanged because of the complexity of per-guest conditional VERW which is not easy to handle that late in asm with no GPRs available. If the CPU is also affected by MDS, VERW is unconditionally executed late in asm regardless of guest having MMIO access. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-6-a6216d83edb7%40linux.intel.com	2024-02-19 16:31:59 -08:00
Sean Christopherson	706a189dcf	KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH Use EFLAGS.CF instead of EFLAGS.ZF to track whether to use VMRESUME versus VMLAUNCH. Freeing up EFLAGS.ZF will allow doing VERW, which clobbers ZF, for MDS mitigations as late as possible without needing to duplicate VERW for both paths. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-5-a6216d83edb7%40linux.intel.com	2024-02-19 16:31:54 -08:00
Pawan Gupta	6613d82e61	x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key The VERW mitigation at exit-to-user is enabled via a static branch mds_user_clear. This static branch is never toggled after boot, and can be safely replaced with an ALTERNATIVE() which is convenient to use in asm. Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user path. Also remove the now redundant VERW in exc_nmi() and arch_exit_to_user_mode(). Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-4-a6216d83edb7%40linux.intel.com	2024-02-19 16:31:49 -08:00
Pawan Gupta	a0e2dab44d	x86/entry_32: Add VERW just before userspace transition As done for entry_64, add support for executing VERW late in exit to user path for 32-bit mode. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-3-a6216d83edb7%40linux.intel.com	2024-02-19 16:31:46 -08:00
Pawan Gupta	3c7501722e	x86/entry_64: Add VERW just before userspace transition Mitigation for MDS is to use VERW instruction to clear any secrets in CPU Buffers. Any memory accesses after VERW execution can still remain in CPU buffers. It is safer to execute VERW late in return to user path to minimize the window in which kernel data can end up in CPU buffers. There are not many kernel secrets to be had after SWITCH_TO_USER_CR3. Add support for deploying VERW mitigation after user register state is restored. This helps minimize the chances of kernel data ending up into CPU buffers after executing VERW. Note that the mitigation at the new location is not yet enabled. Corner case not handled ======================= Interrupts returning to kernel don't clear CPUs buffers since the exit-to-user path is expected to do that anyways. But, there could be a case when an NMI is generated in kernel after the exit-to-user path has cleared the buffers. This case is not handled and NMI returning to kernel don't clear CPU buffers because: 1. It is rare to get an NMI after VERW, but before returning to userspace. 2. For an unprivileged user, there is no known way to make that NMI less rare or target it. 3. It would take a large number of these precisely-timed NMIs to mount an actual attack. There's presumably not enough bandwidth. 4. The NMI in question occurs after a VERW, i.e. when user state is restored and most interesting data is already scrubbed. Whats left is only the data that NMI touches, and that may or may not be of any interest. Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-2-a6216d83edb7%40linux.intel.com	2024-02-19 16:31:42 -08:00
Pawan Gupta	baf8361e54	x86/bugs: Add asm helpers for executing VERW MDS mitigation requires clearing the CPU buffers before returning to user. This needs to be done late in the exit-to-user path. Current location of VERW leaves a possibility of kernel data ending up in CPU buffers for memory accesses done after VERW such as: 1. Kernel data accessed by an NMI between VERW and return-to-user can remain in CPU buffers since NMI returning to kernel does not execute VERW to clear CPU buffers. 2. Alyssa reported that after VERW is executed, CONFIG_GCC_PLUGIN_STACKLEAK=y scrubs the stack used by a system call. Memory accesses during stack scrubbing can move kernel stack contents into CPU buffers. 3. When caller saved registers are restored after a return from function executing VERW, the kernel stack accesses can remain in CPU buffers(since they occur after VERW). To fix this VERW needs to be moved very late in exit-to-user path. In preparation for moving VERW to entry/exit asm code, create macros that can be used in asm. Also make VERW patching depend on a new feature flag X86_FEATURE_CLEAR_CPU_BUF. Reported-by: Alyssa Milburn <alyssa.milburn@intel.com> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-1-a6216d83edb7%40linux.intel.com	2024-02-19 16:31:33 -08:00
James Morse	fb700810d3	x86/resctrl: Separate arch and fs resctrl locks resctrl has one mutex that is taken by the architecture-specific code, and the filesystem parts. The two interact via cpuhp, where the architecture code updates the domain list. Filesystem handlers that walk the domains list should not run concurrently with the cpuhp callback modifying the list. Exposing a lock from the filesystem code means the interface is not cleanly defined, and creates the possibility of cross-architecture lock ordering headaches. The interaction only exists so that certain filesystem paths are serialised against CPU hotplug. The CPU hotplug code already has a mechanism to do this using cpus_read_lock(). MPAM's monitors have an overflow interrupt, so it needs to be possible to walk the domains list in irq context. RCU is ideal for this, but some paths need to be able to sleep to allocate memory. Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part of a cpuhp callback, cpus_read_lock() must always be taken first. rdtgroup_schemata_write() already does this. Most of the filesystem code's domain list walkers are currently protected by the rdtgroup_mutex taken in rdtgroup_kn_lock_live(). The exceptions are rdt_bit_usage_show() and the mon_config helpers which take the lock directly. Make the domain list protected by RCU. An architecture-specific lock prevents concurrent writers. rdt_bit_usage_show() could walk the domain list using RCU, but to keep all the filesystem operations the same, this is changed to call cpus_read_lock(). The mon_config helpers send multiple IPIs, take the cpus_read_lock() in these cases. The other filesystem list walkers need to be able to sleep. Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the cpuhp callbacks can't be invoked when file system operations are occurring. Add lockdep_assert_cpus_held() in the cases where the rdtgroup_kn_lock_live() call isn't obvious. Resctrl's domain online/offline calls now need to take the rdtgroup_mutex themselves. [ bp: Fold in a build fix: https://lore.kernel.org/r/87zfvwieli.ffs@tglx ] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-25-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-19 19:28:07 +01:00
Linus Torvalds	6c160f16be	Kbuild fixes for v6.8 (2nd) - Reformat nested if-conditionals in Makefiles with 4 spaces - Fix CONFIG_DEBUG_INFO_BTF builds for big endian - Fix modpost for module srcversion - Fix an escape sequence warning in gen_compile_commands.py - Fix kallsyms to ignore ARMv4 thunk symbols -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmXSPDIVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsGRbEP/3oiRjevkrWG32cVy8ozNLZFZ87u tGDs3NNnV0XyQ5ymkRPVmSoahndatcg4/zI1PQ5/l0ryhqvF4egSHMZZ1zwGwtOz pj+VhT4525U+jjlYTX760VLBeOkzGB7Rmpr3zihy5Amg0TTiqDU0OKWDrKZrMLEw O9HGDJ0GlmEtVCcQ0yZg4bzfsRmgykZzGbc0p2OijUE321q5Svzezr0RpW3nXQwL MlsHLtFEas35wzK4JN2s8MDQ4x4bqG8wI4fikXA/gioMA+PMFKZNqcw/BuUey+Qz r8HwSFkftqbOtjWzn6FtisLzUfdcT/ycDZnWTGb4qbHt19YETXVpg0fKVZktnSzv h/0vvgwBP1r5h4J9N0GGURRV0Cx+LM94uNVgdy9neRtk3f4E0MbGtSe7xZ+7iRUj UZ676ul6QYfpaxAS8+/6pilQ7AKQ1Z2qoNPZG5aN44x0YR2qQk7aFc+RH5d1FnMU ZYh+0Se9JGlvobWBQiQw9NZ/3GUCBgC/HhHGqrrRnzU9lJCfRsG4kGhrKmgiUgJb z2EMZPDKDW58zQ+A9khBZSvqFwVL43oQTyXiFdaWMCFAVAY7pOC2h0e1kBn2Mth4 qVIO9w5muet7u9ouoEfz7ZfXpDYCBOYwhGvkVG//0Ac71bKq1ZBYvl04P7QuMjxf YGihyF43epnMyECK =hE/P -----END PGP SIGNATURE----- Merge tag 'kbuild-fixes-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Reformat nested if-conditionals in Makefiles with 4 spaces - Fix CONFIG_DEBUG_INFO_BTF builds for big endian - Fix modpost for module srcversion - Fix an escape sequence warning in gen_compile_commands.py - Fix kallsyms to ignore ARMv4 thunk symbols * tag 'kbuild-fixes-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kallsyms: ignore ARMv4 thunks along with others modpost: trim leading spaces when processing source files list gen_compile_commands: fix invalid escape sequence warning kbuild: Fix changing ELF file type for output of gen_btf for big endian docs: kconfig: Fix grammar and formatting kbuild: use 4-space indentation when followed by conditionals	2024-02-18 10:09:25 -08:00
Linus Torvalds	ddac3d8b8a	- Use a GB page for identity mapping only when memory of this size is requested so that mapping of reserved regions is prevented which would otherwise lead to system crashes on UV machines -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXR304ACgkQEsHwGGHe VUq+lhAAugdnBJMBOX1MYAZELYt4hHhUZx2VHIoGzaKjEeNpgz6WZ5WWfBDMtFyh dO0ijZlIen/aXflNnZcHxgTdEE1rsSc0+7u7I5/RNJFRnI2aawhOFcy8aUHlk8mB 5lwa5bFTdUEX5LS8yd38ZnrLVq6NBzHZ0CaCmahBOnqpN5HxgDutB65H2DJex2TW JEFTVcNEBKrLVaZZzDMhv0DalvnvMXUWxAyQwqmi+n4jTADvpzyJGFYIXQ6DJgSW MOd00NOC0haX6Mg78wRjTdcgxq9DVfLxrk8zE/uj99w5pm/vpxTeD/Lg5dElR99i 1waTGUoWUMCWOKcPfjoZRCvYhgbfCPMivdcKb2yB/aKdTwFjFevAb2tYeXTd8nSm lRFRhdx5JrPIFzvETBnE3h/CCY5NL7T3UO/fOaJXZum1pHyJCUWMNbQWanbhT4Oz cRPKafRSxpfL1v33q9TXIfweCbX7XgzVytOBZ6HzinjmgzFNYD57GtbrI3zjW6qG nO3AgPFzb+ly7pQLEqpAxvJTDO52scAyyJH4WCIIMPaIlMZKTAWc8G3kUWqQIBmj 88j/cMdp6rkLNqsxcbbcQVMjwU8j6Kz0Kw1nkFT969X9OVFXKRQAhIpdCsFMBYXY jjUojzbNW5bc6o96LQ5ZcGaZiO2Vn9dvHJScuHWz5Elpe3QH8oA= =B6od -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.8_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: - Use a GB page for identity mapping only when memory of this size is requested so that mapping of reserved regions is prevented which would otherwise lead to system crashes on UV machines * tag 'x86_urgent_for_v6.8_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/ident_map: Use gbpages only where full GB page should be mapped.	2024-02-18 09:22:48 -08:00
Alison Schofield	b626070ffc	x86/numa: Fix the sort compare func used in numa_fill_memblks() The compare function used to sort memblks into starting address order fails when the result of its u64 address subtraction gets truncated to an int upon return. The impact of the bad sort is that memblks will be filled out incorrectly. Depending on the set of memblks, a user may see no errors at all but still have a bad fill, or see messages reporting a node overlap that leads to numa init failure: [] node 0 [mem: ] overlaps with node 1 [mem: ] [] No NUMA configuration found Replace with a comparison that can only result in: 1, 0, -1. Fixes: `8f012db27c` ("x86/numa: Introduce numa_fill_memblks()") Signed-off-by: Alison Schofield <alison.schofield@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/99dcb3ae87e04995e9f293f6158dc8fa0749a487.1705085543.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2024-02-16 23:20:34 -08:00
Alison Schofield	9b99c17f75	x86/numa: Fix the address overlap check in numa_fill_memblks() numa_fill_memblks() fills in the gaps in numa_meminfo memblks over a physical address range. To do so, it first creates a list of existing memblks that overlap that address range. The issue is that it is off by one when comparing to the end of the address range, so memblks that do not overlap are selected. The impact of selecting a memblk that does not actually overlap is that an existing memblk may be filled when the expected action is to do nothing and return NUMA_NO_MEMBLK to the caller. The caller can then add a new NUMA node and memblk. Replace the broken open-coded search for address overlap with the memblock helper memblock_addrs_overlap(). Update the kernel doc and in code comments. Suggested by: "Huang, Ying" <ying.huang@intel.com> Fixes: `8f012db27c` ("x86/numa: Introduce numa_fill_memblks()") Signed-off-by: Alison Schofield <alison.schofield@intel.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/10a3e6109c34c21a8dd4c513cf63df63481a2b07.1705085543.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2024-02-16 23:20:34 -08:00
Sean Christopherson	910c57dfa4	KVM: x86: Mark target gfn of emulated atomic instruction as dirty When emulating an atomic access on behalf of the guest, mark the target gfn dirty if the CMPXCHG by KVM is attempted and doesn't fault. This fixes a bug where KVM effectively corrupts guest memory during live migration by writing to guest memory without informing userspace that the page is dirty. Marking the page dirty got unintentionally dropped when KVM's emulated CMPXCHG was converted to do a user access. Before that, KVM explicitly mapped the guest page into kernel memory, and marked the page dirty during the unmap phase. Mark the page dirty even if the CMPXCHG fails, as the old data is written back on failure, i.e. the page is still written. The value written is guaranteed to be the same because the operation is atomic, but KVM's ABI is that all writes are dirty logged regardless of the value written. And more importantly, that's what KVM did before the buggy commit. Huge kudos to the folks on the Cc list (and many others), who did all the actual work of triaging and debugging. Fixes: `1c2361f667` ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Cc: Pasha Tatashin <tatashin@google.com> Cc: Michael Krebs <mkrebs@google.com> base-commit: 6769ea8da8a93ed4630f1ce64df6aafcaabfce64 Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20240215010004.1456078-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-16 16:56:01 -08:00
Linus Torvalds	683b783c20	ARM: * Avoid dropping the page refcount twice when freeing an unlinked page-table subtree. * Don't source the VFIO Kconfig twice * Fix protected-mode locking order between kvm and vcpus RISC-V: * Fix steal-time related sparse warnings x86: * Cleanup gtod_is_based_on_tsc() to return "bool" instead of an "int" * Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only if the incoming events->nmi.pending is non-zero. If the target vCPU is in the UNITIALIZED state, the spurious request will result in KVM exiting to userspace, which in turn causes QEMU to constantly acquire and release QEMU's global mutex, to the point where the BSP is unable to make forward progress. * Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being incorrectly truncated, and ultimately causes KVM to think a fixed counter has already been disabled (KVM thinks the old value is '0'). * Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace that is ultimately ignored due to ignore_msrs=true doesn't zero the output as intended. Selftests cleanups and fixes: * Remove redundant newlines from error messages. * Delete an unused variable in the AMX test (which causes build failures when compiling with -Werror). * Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE, and the test eventually got skipped). * Fix TSC related bugs in several Hyper-V selftests. * Fix a bug in the dirty ring logging test where a sem_post() could be left pending across multiple runs, resulting in incorrect synchronization between the main thread and the vCPU worker thread. * Relax the dirty log split test's assertions on 4KiB mappings to fix false positives due to the number of mappings for memslot 0 (used for code and data that is NOT being dirty logged) changing, e.g. due to NUMA balancing. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmXPlokUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPs3AgApdYANmMEy2YaUZLYsQOEP388vLEf +CS9kChY6xWuYzdFPTpM4BqNVn46zPh+HDEHTCJy1eOLpeOg6HbaNGuF/1G98+HF COm7C2bWOrGAL/UMzPzciyEMQFE7c/h28Yuq/4XpyDNrFbnChYxPh9W4xexqoLhV QtGYU03guLCUsI5veY0rOrSJ5xEu9f8c63JH5JPahtbMB0uNoi0Kz7i86sbkkUg7 OcTra+j/FyGVAWwEJ8Q2hcGlKn4DMeyQ/riUvPrfSarTqC6ZswKltg9EMSxNnojE LojijqRFjKklkXonnalVeDzJbG0OWHks8VO6JmCJdt0zwBRei0iLWi2LEg== =8/la -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM fixes from Paolo Bonzini: "ARM: - Avoid dropping the page refcount twice when freeing an unlinked page-table subtree. - Don't source the VFIO Kconfig twice - Fix protected-mode locking order between kvm and vcpus RISC-V: - Fix steal-time related sparse warnings x86: - Cleanup gtod_is_based_on_tsc() to return "bool" instead of an "int" - Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only if the incoming events->nmi.pending is non-zero. If the target vCPU is in the UNITIALIZED state, the spurious request will result in KVM exiting to userspace, which in turn causes QEMU to constantly acquire and release QEMU's global mutex, to the point where the BSP is unable to make forward progress. - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being incorrectly truncated, and ultimately causes KVM to think a fixed counter has already been disabled (KVM thinks the old value is '0'). - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace that is ultimately ignored due to ignore_msrs=true doesn't zero the output as intended. Selftests cleanups and fixes: - Remove redundant newlines from error messages. - Delete an unused variable in the AMX test (which causes build failures when compiling with -Werror). - Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE, and the test eventually got skipped). - Fix TSC related bugs in several Hyper-V selftests. - Fix a bug in the dirty ring logging test where a sem_post() could be left pending across multiple runs, resulting in incorrect synchronization between the main thread and the vCPU worker thread. - Relax the dirty log split test's assertions on 4KiB mappings to fix false positives due to the number of mappings for memslot 0 (used for code and data that is NOT being dirty logged) changing, e.g. due to NUMA balancing" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: arm64: Fix double-free following kvm_pgtable_stage2_free_unlinked() RISC-V: KVM: Use correct restricted types RISC-V: paravirt: Use correct restricted types RISC-V: paravirt: steal_time should be static KVM: selftests: Don't assert on exact number of 4KiB in dirty log split test KVM: selftests: Fix a semaphore imbalance in the dirty ring logging test KVM: x86: Fix KVM_GET_MSRS stack info leak KVM: arm64: Do not source virt/lib/Kconfig twice KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl KVM: x86: Make gtod_is_based_on_tsc() return 'bool' KVM: selftests: Make hyperv_clock require TSC based system clocksource KVM: selftests: Run clocksource dependent tests with hyperv_clocksource_tsc_page too KVM: selftests: Use generic sys_clocksource_is_tsc() in vmx_nested_tsc_scaling_test KVM: selftests: Generalize check_clocksource() from kvm_clock_test KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu KVM: arm64: Fix circular locking dependency KVM: selftests: Fail tests when open() fails with !ENOENT KVM: selftests: Avoid infinite loop in hyperv_features when invtsc is missing KVM: selftests: Delete superfluous, unused "stage" variable in AMX test KVM: selftests: x86_64: Remove redundant newlines ...	2024-02-16 10:48:14 -08:00
James Morse	eeff1d4f11	x86/resctrl: Move domain helper migration into resctrl_offline_cpu() When a CPU is taken offline the resctrl filesystem code needs to check if it was the CPU nominated to perform the periodic overflow and limbo work. If so, another CPU needs to be chosen to do this work. This is currently done in core.c, mixed in with the code that removes the CPU from the domain's mask, and potentially free()s the domain. Move the migration of the overflow and limbo helpers into the filesystem code, into resctrl_offline_cpu(). As resctrl_offline_cpu() runs before the architecture code has removed the CPU from the domain mask, the callers need to be told which CPU is being removed, to avoid picking it as the new CPU. This uses the exclude_cpu feature previously added. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-24-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:33 +01:00
James Morse	258c91e84f	x86/resctrl: Add CPU offline callback for resctrl work The resctrl architecture specific code may need to free a domain when a CPU goes offline, it also needs to reset the CPUs PQR_ASSOC register. Amongst other things, the resctrl filesystem code needs to clear this CPU from the cpu_mask of any control and monitor groups. Currently, this is all done in core.c and called from resctrl_offline_cpu(), making the split between architecture and filesystem code unclear. Move the filesystem work to remove the CPU from the control and monitor groups into a filesystem helper called resctrl_offline_cpu(), and rename the one in core.c resctrl_arch_offline_cpu(). Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-23-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:33 +01:00
James Morse	978fcca954	x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU When a CPU is taken offline resctrl may need to move the overflow or limbo handlers to run on a different CPU. Once the offline callbacks have been split, cqm_setup_limbo_handler() will be called while the CPU that is going offline is still present in the CPU mask. Pass the CPU to exclude to cqm_setup_limbo_handler() and mbm_setup_overflow_handler(). These functions can use a variant of cpumask_any_but() when selecting the CPU. -1 is used to indicate no CPUs need excluding. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-22-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:33 +01:00
James Morse	1b3e50ce7f	x86/resctrl: Add CPU online callback for resctrl work The resctrl architecture specific code may need to create a domain when a CPU comes online, it also needs to reset the CPUs PQR_ASSOC register. The resctrl filesystem code needs to update the rdtgroup_default CPU mask when CPUs are brought online. Currently, this is all done in one function, resctrl_online_cpu(). It will need to be split into architecture and filesystem parts before resctrl can be moved to /fs/. Pull the rdtgroup_default update work out as a filesystem specific cpu_online helper. resctrl_online_cpu() is the obvious name for this, which means the version in core.c needs renaming. resctrl_online_cpu() is called by the arch code once it has done the work to add the new CPU to any domains. In future patches, resctrl_online_cpu() will take the rdtgroup_mutex itself. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-21-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:33 +01:00
James Morse	30017b6070	x86/resctrl: Add helpers for system wide mon/alloc capable resctrl reads rdt_alloc_capable or rdt_mon_capable to determine whether any of the resources support the corresponding features. resctrl also uses the static keys that affect the architecture's context-switch code to determine the same thing. This forces another architecture to have the same static keys. As the static key is enabled based on the capable flag, and none of the filesystem uses of these are in the scheduler path, move the capable flags behind helpers, and use these in the filesystem code instead of the static key. After this change, only the architecture code manages and uses the static keys to ensure __resctrl_sched_in() does not need runtime checks. This avoids multiple architectures having to define the same static keys. Cases where the static key implicitly tested if the resctrl filesystem was mounted all have an explicit check now. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-20-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:33 +01:00
James Morse	0a2f4d9b54	x86/resctrl: Make rdt_enable_key the arch's decision to switch rdt_enable_key is switched when resctrl is mounted. It was also previously used to prevent a second mount of the filesystem. Any other architecture that wants to support resctrl has to provide identical static keys. Now that there are helpers for enabling and disabling the alloc/mon keys, resctrl doesn't need to switch this extra key, it can be done by the arch code. Use the static-key increment and decrement helpers, and change resctrl to ensure the calls are balanced. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-19-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:33 +01:00
James Morse	5db6a4a75c	x86/resctrl: Move alloc/mon static keys into helpers resctrl enables three static keys depending on the features it has enabled. Another architecture's context switch code may look different, any static keys that control it should be buried behind helpers. Move the alloc/mon logic into arch-specific helpers as a preparatory step for making the rdt_enable_key's status something the arch code decides. This means other architectures don't have to mirror the static keys. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-18-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:32 +01:00
James Morse	13e5769deb	x86/resctrl: Make resctrl_mounted checks explicit The rdt_enable_key is switched when resctrl is mounted, and used to prevent a second mount of the filesystem. It also enables the architecture's context switch code. This requires another architecture to have the same set of static keys, as resctrl depends on them too. The existing users of these static keys are implicitly also checking if the filesystem is mounted. Make the resctrl_mounted checks explicit: resctrl can keep track of whether it has been mounted once. This doesn't need to be combined with whether the arch code is context switching the CLOSID. rdt_mon_enable_key is never used just to test that resctrl is mounted, but does also have this implication. Add a resctrl_mounted to all uses of rdt_mon_enable_key. This will allow the static key changing to be moved behind resctrl_arch_ calls. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-17-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:32 +01:00
James Morse	e557999f80	x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read() Depending on the number of monitors available, Arm's MPAM may need to allocate a monitor prior to reading the counter value. Allocating a contended resource may involve sleeping. __check_limbo() and mon_event_count() each make multiple calls to resctrl_arch_rmid_read(), to avoid extra work on contended systems, the allocation should be valid for multiple invocations of resctrl_arch_rmid_read(). The memory or hardware allocated is not specific to a domain. Add arch hooks for this allocation, which need calling before resctrl_arch_rmid_read(). The allocated monitor is passed to resctrl_arch_rmid_read(), then freed again afterwards. The helper can be called on any CPU, and can sleep. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-16-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:32 +01:00
James Morse	6fde1424f2	x86/resctrl: Allow resctrl_arch_rmid_read() to sleep MPAM's cache occupancy counters can take a little while to settle once the monitor has been configured. The maximum settling time is described to the driver via a firmware table. The value could be large enough that it makes sense to sleep. To avoid exposing this to resctrl, it should be hidden behind MPAM's resctrl_arch_rmid_read(). resctrl_arch_rmid_read() may be called via IPI meaning it is unable to sleep. In this case, it should return an error if it needs to sleep. This will only affect MPAM platforms where the cache occupancy counter isn't available immediately, nohz_full is in use, and there are no housekeeping CPUs in the necessary domain. There are three callers of resctrl_arch_rmid_read(): __mon_event_count() and __check_limbo() are both called from a non-migrateable context. mon_event_read() invokes __mon_event_count() using smp_call_on_cpu(), which adds work to the target CPUs workqueue. rdtgroup_mutex() is held, meaning this cannot race with the resctrl cpuhp callback. __check_limbo() is invoked via schedule_delayed_work_on() also adds work to a per-cpu workqueue. The remaining call is add_rmid_to_limbo() which is called in response to a user-space syscall that frees an RMID. This opportunistically reads the LLC occupancy counter on the current domain to see if the RMID is over the dirty threshold. This has to disable preemption to avoid reading the wrong domain's value. Disabling preemption here prevents resctrl_arch_rmid_read() from sleeping. add_rmid_to_limbo() walks each domain, but only reads the counter on one domain. If the system has more than one domain, the RMID will always be added to the limbo list. If the RMIDs usage was not over the threshold, it will be removed from the list when __check_limbo() runs. Make this the default behaviour. Free RMIDs are always added to the limbo list for each domain. The user visible effect of this is that a clean RMID is not available for re-allocation immediately after 'rmdir()' completes. This behaviour was never portable as it never happened on a machine with multiple domains. Removing this path allows resctrl_arch_rmid_read() to sleep if its called with interrupts unmasked. Document this is the expected behaviour, and add a might_sleep() annotation to catch changes that won't work on arm64. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-15-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:32 +01:00
James Morse	09909e0981	x86/resctrl: Queue mon_event_read() instead of sending an IPI Intel is blessed with an abundance of monitors, one per RMID, that can be read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC, the number implemented is up to the manufacturer. This means when there are fewer monitors than needed, they need to be allocated and freed. MPAM's CSU monitors are used to back the 'llc_occupancy' monitor file. The CSU counter is allowed to return 'not ready' for a small number of micro-seconds after programming. To allow one CSU hardware monitor to be used for multiple control or monitor groups, the CPU accessing the monitor needs to be able to block when configuring and reading the counter. Worse, the domain may be broken up into slices, and the MMIO accesses for each slice may need performing from different CPUs. These two details mean MPAMs monitor code needs to be able to sleep, and IPI another CPU in the domain to read from a resource that has been sliced. mon_event_read() already invokes mon_event_count() via IPI, which means this isn't possible. On systems using nohz-full, some CPUs need to be interrupted to run kernel work as they otherwise stay in user-space running realtime workloads. Interrupting these CPUs should be avoided, and scheduling work on them may never complete. Change mon_event_read() to pick a housekeeping CPU, (one that is not using nohz_full) and schedule mon_event_count() and wait. If all the CPUs in a domain are using nohz-full, then an IPI is used as the fallback. This function is only used in response to a user-space filesystem request (not the timing sensitive overflow code). This allows MPAM to hide the slice behaviour from resctrl, and to keep the monitor-allocation in monitor.c. When the IPI fallback is used on machines where MPAM needs to make an access on multiple CPUs, the counter read will always fail. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Peter Newman <peternewman@google.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-14-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:32 +01:00
James Morse	a4846aaf39	x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow The limbo and overflow code picks a CPU to use from the domain's list of online CPUs. Work is then scheduled on these CPUs to maintain the limbo list and any counters that may overflow. cpumask_any() may pick a CPU that is marked nohz_full, which will either penalise the work that CPU was dedicated to, or delay the processing of limbo list or counters that may overflow. Perhaps indefinitely. Delaying the overflow handling will skew the bandwidth values calculated by mba_sc, which expects to be called once a second. Add cpumask_any_housekeeping() as a replacement for cpumask_any() that prefers housekeeping CPUs. This helper will still return a nohz_full CPU if that is the only option. The CPU to use is re-evaluated each time the limbo/overflow work runs. This ensures the work will move off a nohz_full CPU once a housekeeping CPU is available. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-13-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:32 +01:00
James Morse	6eca639d83	x86/resctrl: Move CLOSID/RMID matching and setting to use helpers When switching tasks, the CLOSID and RMID that the new task should use are stored in struct task_struct. For x86 the CLOSID known by resctrl, the value in task_struct, and the value written to the CPU register are all the same thing. MPAM's CPU interface has two different PARTIDs - one for data accesses the other for instruction fetch. Storing resctrl's CLOSID value in struct task_struct implies the arch code knows whether resctrl is using CDP. Move the matching and setting of the struct task_struct properties to use helpers. This allows arm64 to store the hardware format of the register, instead of having to convert it each time. __rdtgroup_move_task()s use of READ_ONCE()/WRITE_ONCE() ensures torn values aren't seen as another CPU may schedule the task being moved while the value is being changed. MPAM has an additional corner-case here as the PMG bits extend the PARTID space. If the scheduler sees a new-CLOSID but old-RMID, the task will dirty an RMID that the limbo code is not watching causing an inaccurate count. x86's RMID are independent values, so the limbo code will still be watching the old-RMID in this circumstance. To avoid this, arm64 needs both the CLOSID/RMID WRITE_ONCE()d together. Both values must be provided together. Because MPAM's RMID values are not unique, the CLOSID must be provided when matching the RMID. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-12-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:32 +01:00
James Morse	6eac36bb9e	x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be used for different control groups. This means once a CLOSID is allocated, all its monitoring ids may still be dirty, and held in limbo. Instead of allocating the first free CLOSID, on architectures where CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is enabled, search closid_num_dirty_rmid[] to find the cleanest CLOSID. The CLOSID found is returned to closid_alloc() for the free list to be updated. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-11-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:32 +01:00
James Morse	5d920b6881	x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding The resctrl CLOSID allocator uses a single 32bit word to track which CLOSID are free. The setting and clearing of bits is open coded. Convert the existing open coded bit manipulations of closid_free_map to use __set_bit() and friends. These don't need to be atomic as this list is protected by the mutex. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-10-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:31 +01:00
James Morse	b30a55df60	x86/resctrl: Track the number of dirty RMID a CLOSID has MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be used for different control groups. This means once a CLOSID is allocated, all its monitoring ids may still be dirty, and held in limbo. Keep track of the number of RMID held in limbo each CLOSID has. This will allow a future helper to find the 'cleanest' CLOSID when allocating. The array is only needed when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined. This will never be the case on x86. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-9-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:31 +01:00
James Morse	c4c0376eef	x86/resctrl: Allow RMID allocation to be scoped by CLOSID MPAMs RMID values are not unique unless the CLOSID is considered as well. alloc_rmid() expects the RMID to be an independent number. Pass the CLOSID in to alloc_rmid(). Use this to compare indexes when allocating. If the CLOSID is not relevant to the index, this ends up comparing the free RMID with itself, and the first free entry will be used. With MPAM the CLOSID is included in the index, so this becomes a walk of the free RMID entries, until one that matches the supplied CLOSID is found. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-8-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:31 +01:00
James Morse	6791e0ea30	x86/resctrl: Access per-rmid structures by index x86 systems identify traffic using the CLOSID and RMID. The CLOSID is used to lookup the control policy, the RMID is used for monitoring. For x86 these are independent numbers. Arm's MPAM has equivalent features PARTID and PMG, where the PARTID is used to lookup the control policy. The PMG in contrast is a small number of bits that are used to subdivide PARTID when monitoring. The cache-occupancy monitors require the PARTID to be specified when monitoring. This means MPAM's PMG field is not unique. There are multiple PMG-0, one per allocated CLOSID/PARTID. If PMG is treated as equivalent to RMID, it cannot be allocated as an independent number. Bitmaps like rmid_busy_llc need to be sized by the number of unique entries for this resource. Treat the combined CLOSID and RMID as an index, and provide architecture helpers to pack and unpack an index. This makes the MPAM values unique. The domain's rmid_busy_llc and rmid_ptrs[] are then sized by index, as are domain mbm_local[] and mbm_total[]. x86 can ignore the CLOSID field when packing and unpacking an index, and report as many indexes as RMID. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-7-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:31 +01:00
James Morse	40fc735b78	x86/resctrl: Track the closid with the rmid x86's RMID are independent of the CLOSID. An RMID can be allocated, used and freed without considering the CLOSID. MPAM's equivalent feature is PMG, which is not an independent number, it extends the CLOSID/PARTID space. For MPAM, only PMG-bits worth of 'RMID' can be allocated for a single CLOSID. i.e. if there is 1 bit of PMG space, then each CLOSID can have two monitor groups. To allow resctrl to disambiguate RMID values for different CLOSID, everything in resctrl that keeps an RMID value needs to know the CLOSID too. This will always be ignored on x86. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-6-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:31 +01:00
James Morse	311639e951	x86/resctrl: Move RMID allocation out of mkdir_rdt_prepare() RMIDs are allocated for each monitor or control group directory, because each of these needs its own RMID. For control groups, rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID. MPAM's equivalent of RMID is not an independent number, so can't be allocated until the CLOSID is known. An RMID allocation for one CLOSID may fail, whereas another may succeed depending on how many monitor groups a control group has. The RMID allocation needs to move to be after the CLOSID has been allocated. Move the RMID allocation out of mkdir_rdt_prepare() to occur in its caller, after the mkdir_rdt_prepare() call. This allows the RMID allocator to know the CLOSID. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-5-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:31 +01:00
James Morse	b1de313979	x86/resctrl: Create helper for RMID allocation and mondata dir creation When monitoring is supported, each monitor and control group is allocated an RMID. For control groups, rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID. MPAM's equivalent of RMID are not an independent number, so can't be allocated until the CLOSID is known. An RMID allocation for one CLOSID may fail, whereas another may succeed depending on how many monitor groups a control group has. The RMID allocation needs to move to be after the CLOSID has been allocated. Move the RMID allocation and mondata dir creation to a helper. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-4-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:31 +01:00
James Morse	3f7b07380d	x86/resctrl: Free rmid_ptrs from resctrl_exit() rmid_ptrs[] is allocated from dom_data_init() but never free()d. While the exit text ends up in the linker script's DISCARD section, the direction of travel is for resctrl to be/have loadable modules. Add resctrl_put_mon_l3_config() to cleanup any memory allocated by rdt_get_mon_l3_config(). There is no reason to backport this to a stable kernel. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Link: https://lore.kernel.org/r/20240213184438.16675-3-james.morse@arm.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-02-16 19:18:31 +01:00
Thomas Gleixner	89b0f15f40	x86/cpu/topology: Get rid of cpuinfo::x86_max_cores Now that __num_cores_per_package and __num_threads_per_package are available, cpuinfo::x86_max_cores and the related math all over the place can be replaced with the ready to consume data. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210253.176147806@linutronix.de	2024-02-16 15:51:32 +01:00
Borislav Petkov (AMD)	03ceaf678d	x86/CPU/AMD: Do the common init on future Zens too There's no need to enable the common Zen init stuff for each new family - just do it by default on everything >= 0x17 family. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240201161024.30839-1-bp@alien8.de	2024-02-16 13:15:12 +01:00
Hou Tao	32019c659e	x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() When trying to use copy_from_kernel_nofault() to read vsyscall page through a bpf program, the following oops was reported: BUG: unable to handle page fault for address: ffffffffff600000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...... RIP: 0010:copy_from_kernel_nofault+0x6f/0x110 ...... Call Trace: <TASK> ? copy_from_kernel_nofault+0x6f/0x110 bpf_probe_read_kernel+0x1d/0x50 bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d trace_call_bpf+0xc5/0x1c0 perf_call_bpf_enter.isra.0+0x69/0xb0 perf_syscall_enter+0x13e/0x200 syscall_trace_enter+0x188/0x1c0 do_syscall_64+0xb5/0xe0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 </TASK> ...... ---[ end trace 0000000000000000 ]--- The oops is triggered when: 1) A bpf program uses bpf_probe_read_kernel() to read from the vsyscall page and invokes copy_from_kernel_nofault() which in turn calls __get_user_asm(). 2) Because the vsyscall page address is not readable from kernel space, a page fault exception is triggered accordingly. 3) handle_page_fault() considers the vsyscall page address as a user space address instead of a kernel space address. This results in the fix-up setup by bpf not being applied and a page_fault_oops() is invoked due to SMAP. Considering handle_page_fault() has already considered the vsyscall page address as a userspace address, fix the problem by disallowing vsyscall page read for copy_from_kernel_nofault(). Originally-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: syzbot+72aa0161922eba61b50e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com Reported-by: xingwei lee <xrivendell7@gmail.com> Closes: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240202103935.3154011-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-15 19:21:39 -08:00
Hou Tao	ee0e39a63b	x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h Move is_vsyscall_vaddr() into asm/vsyscall.h to make it available for copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c. Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20240202103935.3154011-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-15 19:21:39 -08:00
Jakub Kicinski	73be9a3aab	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/dev.c `9f30831390` ("net: add rcu safety to rtnl_prop_list_size()") `723de3ebef` ("net: free altname using an RCU callback") net/unix/garbage.c `11498715f2` ("af_unix: Remove io_uring code for GC.") `25236c91b5` ("af_unix: Fix task hung while purging oob_skb in GC.") drivers/net/ethernet/renesas/ravb_main.c `ed4adc0720` ("net: ravb: Count packets instead of descriptors in GbEth RX path" ) `c2da940857` ("ravb: Add Rx checksum offload support for GbEth") net/mptcp/protocol.c `bdd70eb689` ("mptcp: drop the push_pending field") `28e5c13805` ("mptcp: annotate lockless accesses around read-mostly fields") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-15 16:20:04 -08:00
Thomas Gleixner	fd43b8ae76	x86/cpu/topology: Provide __num_[cores\|threads]_per_package Expose properly accounted information and accessors so the fiddling with other topology variables can be replaced. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210253.120958987@linutronix.de	2024-02-15 22:07:45 +01:00
Thomas Gleixner	bd745d1c41	x86/cpu/topology: Rename topology_max_die_per_package() The plural of die is dies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210253.065874205@linutronix.de	2024-02-15 22:07:45 +01:00
Thomas Gleixner	8078f4d610	x86/cpu/topology: Rename smp_num_siblings It's really a non-intuitive name. Rename it to __max_threads_per_core which is obvious. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210253.011307973@linutronix.de	2024-02-15 22:07:45 +01:00
Thomas Gleixner	3205c9833d	x86/cpu/topology: Retrieve cores per package from topology bitmaps Similar to other sizing information the number of cores per package can be established from the topology bitmap. Provide a function for retrieving that information and replace the buggy hack in the CPUID evaluation with it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.956858282@linutronix.de	2024-02-15 22:07:45 +01:00
Thomas Gleixner	380414be78	x86/cpu/topology: Use topology logical mapping mechanism Replace the logical package and die management functionality and retrieve the logical IDs from the topology bitmaps. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.901865302@linutronix.de	2024-02-15 22:07:44 +01:00
Thomas Gleixner	b7065f4f84	x86/cpu/topology: Provide logical pkg/die mapping With the topology bitmaps in place the logical package and die IDs can trivially be retrieved by determining the bitmap weight of the relevant topology domain level up to and including the physical ID in question. Provide a function to that effect. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.846136196@linutronix.de	2024-02-15 22:07:44 +01:00
Thomas Gleixner	5e40fb2d4a	x86/cpu/topology: Simplify cpu_mark_primary_thread() No point in creating a mask via fls(). smp_num_siblings is guaranteed to be a power of 2. So just using (smp_num_siblings - 1) has the same effect. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.791176581@linutronix.de	2024-02-15 22:07:44 +01:00
Thomas Gleixner	882e0cff9e	x86/cpu/topology: Mop up primary thread mask handling The early initcall to initialize the primary thread mask is not longer required because topology_init_possible_cpus() can mark primary threads correctly when initializing the possible and present map as the number of SMT threads is already determined correctly. The XENPV workaround is not longer required because XENPV now registers fake APIC IDs which will just work like any other enumeration. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.736104257@linutronix.de	2024-02-15 22:07:44 +01:00
Thomas Gleixner	090610ba70	x86/cpu/topology: Use topology bitmaps for sizing Now that all possible APIC IDs are tracked in the topology bitmaps, its trivial to retrieve the real information from there. This gets rid of the guesstimates for the maximal packages and dies per package as the actual numbers can be determined before a single AP has been brought up. The number of SMT threads can now be determined correctly from the bitmaps in all situations. Up to now a system which has SMT disabled in the BIOS will still claim that it is SMT capable, because the lowest APIC ID bit is reserved for that and CPUID leaf 0xb/0x1f still enumerates the SMT domain accordingly. By calculating the bitmap weights of the SMT and the CORE domain and setting them into relation the SMT disabled in BIOS situation reports correctly that the system is not SMT capable. It also handles the situation correctly when a hybrid systems boot CPU does not have SMT as it takes the SMT capability of the APs fully into account. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.681709880@linutronix.de	2024-02-15 22:07:44 +01:00
Thomas Gleixner	354da4cf57	x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT It turns out that XEN/PV Dom0 has halfways usable CPUID/MADT enumeration except that it cannot deal with CPUs which are enumerated as disabled in MADT. DomU has no MADT and provides at least rudimentary topology information in CPUID leaves 1 and 4. For both it's important that there are not more possible Linux CPUs than vCPUs provided by the hypervisor. As this is ensured by counting the vCPUs before enumeration happens: - lift the restrictions in the CPUID evaluation and the MADT parser - Utilize MADT registration for Dom0 - Keep the fake APIC ID registration for DomU - Fix the XEN APIC fake so the readout of the local APIC ID works for Dom0 via the hypercall and for DomU by returning the registered fake APIC IDs. With that the XEN/PV fake approximates usefulness. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.626195405@linutronix.de	2024-02-15 22:07:44 +01:00
Thomas Gleixner	c8f808231f	x86/xen/smp_pv: Count number of vCPUs early XEN/PV has a completely broken vCPU enumeration scheme, which just works by chance and provides zero topology information. Each vCPU ends up being a single core package. Dom0 provides MADT which can be used for topology information, but that table is the unmodified host table, which means that there can be more CPUs registered than the number of vCPUs XEN provides for the dom0 guest. DomU does not have ACPI and both rely on counting the possible vCPUs via an hypercall. To prepare for using CPUID topology information either via MADT or via fake APIC IDs count the number of possible CPUs during early boot and adjust nr_cpu_ids() accordingly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.571795063@linutronix.de	2024-02-15 22:07:44 +01:00
Thomas Gleixner	ea2dd8a5d4	x86/cpu/topology: Assign hotpluggable CPUIDs during init There is no point in assigning the CPU numbers during ACPI physical hotplug. The number of possible hotplug CPUs is known when the possible map is initialized, so the CPU numbers can be associated to the registered non-present APIC IDs right there. This allows to put more code into the __init section and makes the related data __ro_after_init. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.517339971@linutronix.de	2024-02-15 22:07:44 +01:00
Thomas Gleixner	7cdcdab1a6	x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug The topology bitmaps track all possible APIC IDs which have been registered during enumeration. As sizing and further topology information is going to be derived from these bitmaps, reject attempts to hotplug an APIC ID which was not registered during enumeration. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.462231229@linutronix.de	2024-02-15 22:07:43 +01:00
Thomas Gleixner	f1f758a805	x86/topology: Add a mechanism to track topology via APIC IDs Topology on X86 is determined by the registered APIC IDs and the segmentation information retrieved from CPUID. Depending on the granularity of the provided CPUID information the most fine grained scheme looks like this according to Intel terminology: [PKG][DIEGRP][DIE][TILE][MODULE][CORE][THREAD] Not enumerated domain levels consume 0 bits in the APIC ID. This allows to provide a consistent view at the topology and determine other information precisely like the number of cores in a package on hybrid systems, where the existing assumption that number or cores == number of threads / threads per core does not hold. Provide per domain level bitmaps which record the APIC ID split into the domain levels to make later evaluation of domain level specific information simple. This allows to calculate e.g. the logical IDs without any further extra logic. Contrary to the existing registration mechanism this records disabled CPUs, which are subject to later hotplug as well. That's useful for boot time sizing of package or die dependent allocations without using heuristics. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.406985021@linutronix.de	2024-02-15 22:07:43 +01:00
Thomas Gleixner	5c5682b9f8	x86/cpu: Detect real BSP on crash kernels When a kdump kernel is started from a crashing CPU then there is no guarantee that this CPU is the real boot CPU (BSP). If the kdump kernel tries to online the BSP then the INIT sequence will reset the machine. There is a command line option to prevent this, but in case of nested kdump kernels this is wrong. But that command line option is not required at all because the real BSP is enumerated as the first CPU by firmware. Support for the only known system which was different (Voyager) got removed long ago. Detect whether the boot CPU APIC ID is the first APIC ID enumerated by the firmware. If the first APIC ID enumerated is not matching the boot CPU APIC ID then skip registering it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.348542071@linutronix.de	2024-02-15 22:07:43 +01:00
Thomas Gleixner	7c0edad364	x86/cpu/topology: Rework possible CPU management Managing possible CPUs is an unreadable and uncomprehensible maze. Aside of that it's backwards because it applies command line limits after registering all APICs. Rewrite it so that it: - Applies the command line limits upfront so that only the allowed amount of APIC IDs can be registered. - Applies eventual late restrictions in an understandable way - Uses simple min_t() calculations which are trivial to follow. - Provides a separate function for resetting to UP mode late in the bringup process. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.290098853@linutronix.de	2024-02-15 22:07:43 +01:00
Thomas Gleixner	0e53e7b656	x86/cpu/topology: Sanitize the APIC admission logic Move the actually required content of generic_processor_id() into the call sites and use common helper functions for them. This separates the early boot registration and the ACPI hotplug mechanism completely which allows further cleanups and improvements. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.230433953@linutronix.de	2024-02-15 22:07:43 +01:00
Thomas Gleixner	6055f6cf0d	x86/smpboot: Make error message actually useful "smpboot: native_kick_ap: bad cpu 33" is absolutely useless information. Replace it with something meaningful which allows to decode the failure condition. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.170806023@linutronix.de	2024-02-15 22:07:43 +01:00
Thomas Gleixner	72530464ed	x86/cpu/topology: Use a data structure for topology info Put the processor accounting into a data structure, which will gain more topology related information in the next steps, and sanitize the accounting. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.111451909@linutronix.de	2024-02-15 22:07:43 +01:00
Thomas Gleixner	4c4c6f3870	x86/cpu/topology: Simplify APIC registration Having the same check whether the number of assigned CPUs has reached the nr_cpu_ids limit twice in the same code path is pointless. Repeating the information that CPUs are ignored over and over is also pointless noise. Remove the redundant check and reduce the noise by using a pr_warn_once(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210252.050264369@linutronix.de	2024-02-15 22:07:43 +01:00
Thomas Gleixner	58aa34abe9	x86/cpu/topology: Confine topology information Now that all external fiddling with num_processors and disabled_cpus is gone, move the last user prefill_possible_map() into the topology code too and remove the global visibility of these variables. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210251.994756960@linutronix.de	2024-02-15 22:07:42 +01:00
Thomas Gleixner	e753070234	x86/xen/smp_pv: Register fake APICs XENPV does not use the APIC. It's just piggy packing on the infrastructure and fiddles with global variables as it sees fit. These global variables are going away, so let XENPV register pseudo APIC IDs to keep the accounting correct and keep up the illusion that XEN/PV is something sane. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210251.940043512@linutronix.de	2024-02-15 22:07:42 +01:00
Thomas Gleixner	cab8e164a4	x86/acpi: Dont invoke topology_register_apic() for XEN PV The MADT table for XEN/PV dom0 is not really useful and registering the APICs is momentarily a pointless exercise because XENPV does not use an APIC at all. It overrides the x86_init.mpparse.parse_smp_config() callback, resets num_processors and counts how many of them are provided by the hypervisor. This is in the way of cleaning up the APIC registration. Prevent MADT registration for XEN/PV temporarily until the rework is completed and XEN/PV can use the MADT again. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210251.885489468@linutronix.de	2024-02-15 22:07:42 +01:00
Thomas Gleixner	8098428c54	x86/mpparse: Use new APIC registration function Aside of switching over to the new interface, record the number of registered CPUs locally, which allows to make num_processors and disabled_cpus confined to the topology code. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210251.830955273@linutronix.de	2024-02-15 22:07:42 +01:00
Thomas Gleixner	7d319c0fca	x86/of: Use new APIC registration functions No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210251.776009244@linutronix.de	2024-02-15 22:07:42 +01:00
Thomas Gleixner	8cd01c8a68	x86/jailhouse: Use new APIC registration function No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210251.720970412@linutronix.de	2024-02-15 22:07:42 +01:00
Thomas Gleixner	ff37b09c84	x86/acpi: Use new APIC registration functions Use the new topology registration functions and make the early boot code path __init. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210251.664738831@linutronix.de	2024-02-15 22:07:42 +01:00
Thomas Gleixner	4176b541c2	x86/cpu/topology: Provide separate APIC registration functions generic_processor_info() aside of being a complete misnomer is used for both early boot registration and ACPI CPU hotplug. While it's arguable that this can share some code, it results in code which is hard to understand and kept around post init for no real reason. Also the call sites do lots of manual fiddling in topology related variables instead of having proper interfaces for the purpose which handle the topology internals correctly. Provide topology_register_apic(), topology_hotplug_apic() and topology_hotunplug_apic() which have the extra magic of the call sites incorporated and for now are wrappers around generic_processor_info(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210251.605007456@linutronix.de	2024-02-15 22:07:42 +01:00
Thomas Gleixner	c0a66c2847	x86/cpu/topology: Move registration out of APIC code The APIC/CPU registration sits in the middle of the APIC code. In fact this is a topology evaluation function and has nothing to do with the inner workings of the local APIC. Move it out into a file which reflects what this is about. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240213210251.543948812@linutronix.de	2024-02-15 22:07:41 +01:00
Thomas Gleixner	1a5d0f62d1	x86/apic: Use a proper define for invalid ACPI CPU ID The ACPI ID for CPUs is preset with U32_MAX which is completely non obvious. Use a proper define for it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154640.177504138@linutronix.de	2024-02-15 22:07:41 +01:00
Thomas Gleixner	4a5f72a4a3	x86/apic: Remove yet another dubious callback Paranoia is not wrong, but having an APIC callback which is in most implementations a complete NOOP and in one actually looking whether the APICID of an upcoming CPU has been registered. The same APICID which was used to bring the CPU out of wait for startup. That's paranoia for the paranoia sake. Remove the voodoo. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154640.116510935@linutronix.de	2024-02-15 22:07:41 +01:00
Thomas Gleixner	58d1692835	x86/apic: Remove the pointless writeback of boot_cpu_physical_apicid There is absolutely no point to write the APIC ID which was read from the local APIC earlier, back into the local APIC for the 64-bit UP case. Remove that along with the apic callback which is solely there for this pointless exercise. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154640.055288922@linutronix.de	2024-02-15 22:07:41 +01:00
Thomas Gleixner	350b5e2730	x86/mpparse: Remove the physid_t bitmap wrapper physid_t is a wrapper around bitmap. Just remove the onion layer and use bitmap functionality directly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.994904510@linutronix.de	2024-02-15 22:07:41 +01:00
Thomas Gleixner	de6aec2417	x86/mm/numa: Move early mptable evaluation into common code There is no reason to have the early mptable evaluation conditionally invoked only from the AMD numa topology code. Make it explicit and invoke it from setup_arch() right after the corresponding ACPI init call. Remove the pointless wrapper and invoke x86_init::mpparse::early_parse_smp_config() directly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.931761608@linutronix.de	2024-02-15 22:07:41 +01:00
Thomas Gleixner	dcb7600849	x86/mpparse: Switch to new init callbacks Now that all platforms have the new split SMP configuration callbacks set up, flip the switch and remove the old callback pointer and mop up the platform code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.870883080@linutronix.de	2024-02-15 22:07:41 +01:00
Thomas Gleixner	c22e19cd2c	x86/hyperv/vtl: Prepare for separate mpparse callbacks Initialize the new callbacks in preparation for switching the core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.808238769@linutronix.de	2024-02-15 22:07:41 +01:00
Thomas Gleixner	0baf4d485c	x86/xen/smp_pv: Prepare for separate mpparse callbacks Provide a wrapper around the existing function and fill the new callbacks in. No functional change as the new callbacks are not yet operational. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.745028043@linutronix.de	2024-02-15 22:07:40 +01:00
Thomas Gleixner	30c928691c	x86/jailhouse: Prepare for separate mpparse callbacks Provide a wrapper around the existing function and fill the new callbacks in. No functional change as the new callbacks are not yet operational. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.683073662@linutronix.de	2024-02-15 22:07:40 +01:00
Thomas Gleixner	a626ded4e3	x86/platform/intel-mid: Prepare for separate mpparse callbacks Initialize the split SMP configuration callbacks with NOOPs as MID is strictly ACPI only. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20240212154639.620189339@linutronix.de	2024-02-15 22:07:40 +01:00
Thomas Gleixner	fe280ffd7e	x86/platform/ce4100: Prepare for separate mpparse callbacks Select x86_dtb_parse_smp_config() as SMP configuration parser in preparation of splitting up the get_smp_config() callback. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.558085053@linutronix.de	2024-02-15 22:07:40 +01:00
Thomas Gleixner	5faf8ec771	x86/dtb: Rename x86_dtb_init() x86_dtb_init() is a misnomer and it really should be used as a SMP configuration parser which is selected by the platform via x86_init::mpparse:parse_smp_config(). Rename it to x86_dtb_parse_smp_config() in preparation for that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.495992801@linutronix.de	2024-02-15 22:07:40 +01:00
Thomas Gleixner	d0a85126b1	x86/mpparse: Prepare for callback separation In preparation of splitting the get_smp_config() callback, rename default_get_smp_config() to mpparse_get_smp_config() and provide an early and late wrapper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.433811243@linutronix.de	2024-02-15 22:07:40 +01:00
Thomas Gleixner	fc60fd009c	x86/mpparse: Provide separate early/late callbacks The early argument of x86_init::mpparse::get_smp_config() is more than confusing. Provide two callbacks, one for each purpose. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.370491894@linutronix.de	2024-02-15 22:07:40 +01:00
Thomas Gleixner	e061c7ae08	x86/mpparse: Rename default_find_smp_config() MPTABLE is no longer the default SMP configuration mechanism. Rename it to mpparse_find_mptable() because that's what it does. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.306287711@linutronix.de	2024-02-15 22:07:40 +01:00
Thomas Gleixner	3e48d804c8	x86/apic: Remove check_apicid_used() and ioapic_phys_id_map() No more users. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.243307499@linutronix.de	2024-02-15 22:07:39 +01:00
Thomas Gleixner	4b99e735a5	x86/ioapic: Simplify setup_ioapic_ids_from_mpc_nocheck() No need to go through APIC callbacks. It's already established that this is an ancient APIC. So just copy the present mask and use the direct physid* functions all over the place. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.181901887@linutronix.de	2024-02-15 22:07:39 +01:00
Thomas Gleixner	533535afc0	x86/ioapic: Make io_apic_get_unique_id() simpler No need to go through APIC callbacks. It's already established that this is an ancient APIC. So just copy the present mask and use the direct physid* functions all over the place. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.119261725@linutronix.de	2024-02-15 22:07:39 +01:00
Thomas Gleixner	517234446c	x86/apic: Get rid of get_physical_broadcast() There is no point for this function. The only case where this is used is when there is no XAPIC available, which means the broadcast address is 0xF. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.057209154@linutronix.de	2024-02-15 22:07:39 +01:00
Thomas Gleixner	2ac9e529d7	x86/ioapic: Replace some more set bit nonsense Yet another set_bit() operation wrapped in oring a mask. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154638.995080989@linutronix.de	2024-02-15 22:07:39 +01:00
Thomas Gleixner	490cc3c5e7	x86/platform/ce4100: Dont override x86_init.mpparse.setup_ioapic_ids There is no point to do that. The ATOMs have an XAPIC for which this function is a pointless exercise. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154638.931617775@linutronix.de	2024-02-15 22:07:39 +01:00
Thomas Gleixner	52128a7a21	x86/cpu/topology: Make the APIC mismatch warnings complete Detect all possible combinations of mismatch right in the CPUID evaluation code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154638.867699078@linutronix.de	2024-02-15 22:07:39 +01:00
Thomas Gleixner	bcccdf8b30	x86/apic/uv: Remove the private leaf 0xb parser The package shift has been already evaluated by the early CPU init. Put the mindless copy right next to the original leaf 0xb parser. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.637385562@linutronix.de	2024-02-15 22:07:39 +01:00
Thomas Gleixner	d5474e4d2c	x86/xen/smp_pv: Remove cpudata fiddling The new topology CPUID parser installs already fake topology for XEN/PV, which ends up with cpuinfo::max_cores = 1. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.576579177@linutronix.de	2024-02-15 22:07:38 +01:00
Thomas Gleixner	035fc90a9d	x86/apic: Remove unused phys_pkg_id() callback Now that the core code does not use this monstrosity anymore, it's time to put it to rest. The only real purpose was to read the APIC ID on UV and VSMP systems for the actual evaluation. That's what the core code does now. For doing the actual shift operation there is truly no APIC callback required. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.516536121@linutronix.de	2024-02-15 22:07:38 +01:00
Thomas Gleixner	fab75e790f	x86/cpu: Remove x86_coreid_bits No more users. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.455839743@linutronix.de	2024-02-15 22:07:38 +01:00
Thomas Gleixner	6cf70394e7	x86/cpu: Remove topology.c No more users. Stick it into the ugly code museum. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.395230346@linutronix.de	2024-02-15 22:07:38 +01:00
Thomas Gleixner	03fa6bea5a	x86/cpu: Make topology_amd_node_id() use the actual node info Now that everything is converted switch it over and remove the intermediate operation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.334185785@linutronix.de	2024-02-15 22:07:38 +01:00
Thomas Gleixner	d805a69160	x86/mm/numa: Use core domain size on AMD cpuinfo::topo::x86_coreid_bits is about to be phased out. Use the core domain size from the topology information. Add a comment why the early MPTABLE parsing is required and decrapify the loop which sets the APIC ID to node map. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.270320718@linutronix.de	2024-02-15 22:07:38 +01:00
Thomas Gleixner	3279081dd0	x86/cpu: Use common topology code for HYGON Switch it over to use the consolidated topology evaluation and remove the temporary safe guards which are not longer needed. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.207750409@linutronix.de	2024-02-15 22:07:38 +01:00
Thomas Gleixner	c749ce393b	x86/cpu: Use common topology code for AMD Switch it over to the new topology evaluation mechanism and remove the random bits and pieces which are sprinkled all over the place. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.145745053@linutronix.de	2024-02-15 22:07:38 +01:00
Thomas Gleixner	ace278e7ec	x86/smpboot: Teach it about topo.amd_node_id When switching AMD over to the new topology parser then the match functions need to look for AMD systems with the extended topology feature at the new topo.amd_node_id member which is then holding the node id information. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.082979150@linutronix.de	2024-02-15 22:07:37 +01:00
Thomas Gleixner	f7fb3b2dd9	x86/cpu: Provide an AMD/HYGON specific topology parser AMD/HYGON uses various methods for topology evaluation: - Leaf 0x80000008 and 0x8000001e based with an optional leaf 0xb, which is the preferred variant for modern CPUs. Leaf 0xb will be superseded by leaf 0x80000026 soon, which is just another variant of the Intel 0x1f leaf for whatever reasons. - Subleaf 0x80000008 and NODEID_MSR base - Legacy fallback That code is following the principle of random bits and pieces all over the place which results in multiple evaluations and impenetrable code flows in the same way as the Intel parsing did. Provide a sane implementation by clearly separating the three variants and bringing them in the proper preference order in one place. This provides the parsing for both AMD and HYGON because there is no point in having a separate HYGON parser which only differs by 3 lines of code. Any further divergence between AMD and HYGON can be handled in different functions, while still sharing the existing parsers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.020038641@linutronix.de	2024-02-15 22:07:37 +01:00
Thomas Gleixner	7e3ec62867	x86/cpu/amd: Provide a separate accessor for Node ID AMD (ab)uses topology_die_id() to store the Node ID information and topology_max_dies_per_pkg to store the number of nodes per package. This collides with the proper processor die level enumeration which is coming on AMD with CPUID 8000_0026, unless there is a correlation between the two. There is zero documentation about that. So provide new storage and new accessors which for now still access die_id and topology_max_die_per_pkg(). Will be mopped up after AMD and HYGON are converted over. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153624.956116738@linutronix.de	2024-02-15 22:07:37 +01:00
Thomas Gleixner	22d63660c3	x86/cpu: Use common topology code for Intel Intel CPUs use either topology leaf 0xb/0x1f evaluation or the legacy SMP/HT evaluation based on CPUID leaf 0x1/0x4. Move it over to the consolidated topology code and remove the random topology hacks which are sprinkled into the Intel and the common code. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153624.893644349@linutronix.de	2024-02-15 22:07:37 +01:00
Thomas Gleixner	3d41009425	x86/cpu: Provide a sane leaf 0xb/0x1f parser detect_extended_topology() along with it's early() variant is a classic example for duct tape engineering: - It evaluates an array of subleafs with a boatload of local variables for the relevant topology levels instead of using an array to save the enumerated information and propagate it to the right level - It has no boundary checks for subleafs - It prevents updating the die_id with a crude workaround instead of checking for leaf 0xb which does not provide die information. - It's broken vs. the number of dies evaluation as it uses: num_processors[DIE_LEVEL] / num_processors[CORE_LEVEL] which "works" only correctly if there is none of the intermediate topology levels (MODULE/TILE) enumerated. There is zero value in trying to "fix" that code as the only proper fix is to rewrite it from scratch. Implement a sane parser with proper code documentation, which will be used for the consolidated topology evaluation in the next step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153624.830571770@linutronix.de	2024-02-15 22:07:37 +01:00
Thomas Gleixner	92853a7774	x86/cpu: Move __max_die_per_package to common.c In preparation of a complete replacement for the topology leaf 0xb/0x1f evaluation, move __max_die_per_package into the common code. Will be removed once everything is converted over. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153624.768188958@linutronix.de	2024-02-15 22:07:37 +01:00
Thomas Gleixner	598e719c40	x86/cpu: Use common topology code for Centaur and Zhaoxin Centaur and Zhaoxin CPUs use only the legacy SMP detection. Remove the invocations from their 32bit path and exclude them from the 64-bit call path. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153624.706794189@linutronix.de	2024-02-15 22:07:37 +01:00
Thomas Gleixner	bda74aae20	x86/cpu: Add legacy topology parser The legacy topology detection via CPUID leaf 4, which provides the number of cores in the package and CPUID leaf 1 which provides the number of logical CPUs in case that FEATURE_HT is enabled and the CMP_LEGACY feature is not set, is shared for Intel, Centaur and Zhaoxin CPUs. Lift the code from common.c without the early detection hack and provide it as common fallback mechanism. Will be utilized in later changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153624.644448852@linutronix.de	2024-02-15 22:07:37 +01:00
Thomas Gleixner	ebdb203610	x86/cpu: Provide cpu_init/parse_topology() Topology evaluation is a complete disaster and impenetrable mess. It's scattered all over the place with some vendor implementations doing early evaluation and some not. The most horrific part is the permanent overwriting of smt_max_siblings and __max_die_per_package, instead of establishing them once on the boot CPU and validating the result on the APs. The goals are: - One topology evaluation entry point - Proper sharing of pointlessly duplicated code - Proper structuring of the evaluation logic and preferences. - Evaluating important system wide information only once on the boot CPU - Making the 0xb/0x1f leaf parsing less convoluted and actually fixing the short comings of leaf 0x1f evaluation. Start to consolidate the topology evaluation code by providing the entry points for the early boot CPU evaluation and for the final parsing on the boot CPU and the APs. Move the trivial pieces into that new code: - The initialization of cpuinfo_x86::topo - The evaluation of CPUID leaf 1, which presets topo::initial_apicid - topo_apicid is set to topo::initial_apicid when invoked from early boot. When invoked for the final evaluation on the boot CPU it reads the actual APIC ID, which makes apic_get_initial_apicid() obsolete once everything is converted over. Provide a temporary helper function topo_converted() which shields off the not yet converted CPU vendors from invoking code which would break them. This shielding covers all vendor CPUs which support SMP, but not the historical pure UP ones as they only need the topology info init and eventually the initial APIC initialization. Provide two new members in cpuinfo_x86::topo to store the maximum number of SMT siblings and the number of dies per package and add them to the debugfs readout. These two members will be used to populate this information on the boot CPU and to validate the APs against it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240212153624.581436579@linutronix.de	2024-02-15 22:07:36 +01:00
Thomas Gleixner	43d86e3cd9	x86/cpu: Provide cpuid_read() et al. Provide a few helper functions to read CPUID leafs or individual registers into a data structure without requiring unions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/878r3mg570.ffs@tglx	2024-02-15 22:07:36 +01:00
Masahiro Yamada	3b9ab248bc	kbuild: use 4-space indentation when followed by conditionals GNU Make manual [1] clearly forbids a tab at the beginning of the conditional directive line: "Extra spaces are allowed and ignored at the beginning of the conditional directive line, but a tab is not allowed." This will not work for the next release of GNU Make, hence commit `82175d1f94` ("kbuild: Replace tabs with spaces when followed by conditionals") replaced the inappropriate tabs with 8 spaces. However, the 8-space indentation cannot be visually distinguished. Linus suggested 2-4 spaces for those nested if-statements. [2] This commit redoes the replacement with 4 spaces. [1]: https://www.gnu.org/software/make/manual/make.html#Conditional-Syntax [2]: https://lore.kernel.org/all/CAHk-=whJKZNZWsa-VNDKafS_VfY4a5dAjG-r8BZgWk_a-xSepw@mail.gmail.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2024-02-15 06:05:44 +09:00
Paolo Bonzini	2f8ebe43a0	KVM selftests fixes/cleanups (and one KVM x86 cleanup) for 6.8: - Remove redundant newlines from error messages. - Delete an unused variable in the AMX test (which causes build failures when compiling with -Werror). - Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE, and the test eventually got skipped). - Fix TSC related bugs in several Hyper-V selftests. - Fix a bug in the dirty ring logging test where a sem_post() could be left pending across multiple runs, resulting in incorrect synchronization between the main thread and the vCPU worker thread. - Relax the dirty log split test's assertions on 4KiB mappings to fix false positives due to the number of mappings for memslot 0 (used for code and data that is NOT being dirty logged) changing, e.g. due to NUMA balancing. - Have KVM's gtod_is_based_on_tsc() return "bool" instead of an "int" (the function generates boolean values, and all callers treat the return value as a bool). -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmXKupQSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5DiQP/RNSgLrE9+/3oyqo9zpbhio2dKqz4dIk 8Ga1ZE4R89dyMB9jGKtWn3rEkyma3TsB+neVpG9ohHV6j25JJ0vNAkxQu3Gt+gkl uM1lh/IfXPnAKyuy6dW9tpgZYE1v2/KfdWjeEzzxfPjzY/LX3yFiiCKEnUmfjjzZ sSz91nV4KYS4b4xLWTIcBgNJuyLJuL05htTLmCu7t8DKOBHwHxXjSn8qqG8OvAjs FOhf0zgGJKBFdKOw2Y8XeDdKO0RTEyEPHaFILcLEsuhoVIbY5OUmLe32pAFzzMbG hPawUZ5CzC++e339gUgGkRNY80iSnGcYVcZa+ohxOsNBdOWko9z/eGWZUV7qkYDK dkPHMoDnSzUCE2eSYbEB1eR/KOfziJCWMS9SAIJbJxIGb1HYajikwAEZ6FNp3R+u MyCuNlV9TfsGgt4Dx8RctMeH2ROpORRu7h3WPFUBgG2/jOzPk/OR6U8hSzvmhTvL MykZ8IaLmUIYoK/nCY2iwy50lQRxtZ/htqWn3sidCBGY0DXdNlMhvd3Vk9jtUvY5 Fgof0b564eYfk/qO3cMIDd2WFaDejP28JVSn0CNm6z9i54ubCKkSBEb4kTYXXnVK YBHvbZ21Vjg52trudvK5UPt599sxxNBNiSV32ckLFKHS4ZVGSFSBSbsAWiQF157i CbYntmtJhM+D =infW -----END PGP SIGNATURE----- Merge tag 'kvm-x86-selftests-6.8-rcN' of https://github.com/kvm-x86/linux into HEAD KVM selftests fixes/cleanups (and one KVM x86 cleanup) for 6.8: - Remove redundant newlines from error messages. - Delete an unused variable in the AMX test (which causes build failures when compiling with -Werror). - Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE, and the test eventually got skipped). - Fix TSC related bugs in several Hyper-V selftests. - Fix a bug in the dirty ring logging test where a sem_post() could be left pending across multiple runs, resulting in incorrect synchronization between the main thread and the vCPU worker thread. - Relax the dirty log split test's assertions on 4KiB mappings to fix false positives due to the number of mappings for memslot 0 (used for code and data that is NOT being dirty logged) changing, e.g. due to NUMA balancing. - Have KVM's gtod_is_based_on_tsc() return "bool" instead of an "int" (the function generates boolean values, and all callers treat the return value as a bool).	2024-02-14 12:34:58 -05:00
Paolo Bonzini	22d0bc0721	KVM x86 fixes for 6.8: - Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only if the incoming events->nmi.pending is non-zero. If the target vCPU is in the UNITIALIZED state, the spurious request will result in KVM exiting to userspace, which in turn causes QEMU to constantly acquire and release QEMU's global mutex, to the point where the BSP is unable to make forward progress. - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being incorrectly truncated, and ultimately causes KVM to think a fixed counter has already been disabled (KVM thinks the old value is '0'). - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace that is ultimately ignored due to ignore_msrs=true doesn't zero the output as intended. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmXKt90SHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5e5wP/jU3Zuul2e7fb4E6RN/GPhAFSTzG7Cwe 4lVSSSPmOQsEXTKwCOMj7fgwF9qVSLzLRi62MKziTJY/1FDsTcI3xlM7nM2wwQC2 26evIzI3qB54rHQdviuh1jwh6scZH7xLw7kANE+8x4skkm6AZB1IUnj3utR3fEPj mIUA5kGQxEAEDrn0TFzrRgIw4JngKjrCwmpT+vbmR37flC+Rwv8jr4JY1E3cBAT3 KEilv3Fg07gbvagWGZNSSUNqQos5MsnLifdryKbA/vuIJf+j/01CMo5KtLKshiaX t4gXPldVZDXdxjH6im0wRAX4s/FpZg3vVje2OxPbzwMVb5+XvLewzjzagQ1lFA3I gsNXF8uGdYn0fb8T/wQG4ulWBw6A844PSmGONCwLDA+GZuL9xjMIK5d1litvb/im bEP1Ahv6UcnDNKHqRzuFXQENiS2uQdJNLs7p291oDNkTm/CGjDUgFXPuaCehWrUf ZZf1dxmIPM/Xt2j19mS/HnTHD114A8t1GTx799kBXbG4x0ScVQclkhRk6yFG3ObA 14uXxxAdEBoZGBJ2yr5FbddvRLswbWugFoxKbtCZ/CHMopOUQcRRmRb7Lm1NHLtg Ae/sHO6gQ1xcrbwpMCq+6RjFK57yW+n1TB8ZTmAE2RQynGqzReSTlUNtfn3yMg4v hz+2zGzezoeN =92ae -----END PGP SIGNATURE----- Merge tag 'kvm-x86-fixes-6.8-rcN' of https://github.com/kvm-x86/linux into HEAD KVM x86 fixes for 6.8: - Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only if the incoming events->nmi.pending is non-zero. If the target vCPU is in the UNITIALIZED state, the spurious request will result in KVM exiting to userspace, which in turn causes QEMU to constantly acquire and release QEMU's global mutex, to the point where the BSP is unable to make forward progress. - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being incorrectly truncated, and ultimately causes KVM to think a fixed counter has already been disabled (KVM thinks the old value is '0'). - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace that is ultimately ignored due to ignore_msrs=true doesn't zero the output as intended.	2024-02-14 12:34:43 -05:00
Ingo Molnar	4589f199eb	Merge branch 'x86/bugs' into x86/core, to pick up pending changes before dependent patches Merge in pending alternatives patching infrastructure changes, before applying more patches. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-02-14 10:49:37 +01:00
Ingo Molnar	03c11eb3b1	Linux 6.8-rc4 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmXJK4UeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGHsYH/jKmzKXDRsBCcw/Q HGUvFtpohWBOpN6efdf0nxilQisuyQrqKB9fnwvfcdE60VpqMJXFMdlFh/fonxPl JMbpk9y5uw48IJZA43NwTxUrjZ4wyWzv4ZF6YWa+5WdTAJpPLEPhhnLxcHOKklMr 5Cm/7B/M7eB2BXBfc45b1pkKN22q9OXvjaKxZ+5wYmiMxS+GC8l8jiJ/WlHX78PR eLgsa1v732f2D7YF75wVhaoYepR+QzA9wTKqhjMNCEaVc2PQhA2JRsBXEt84qEIa FZigmf7LLc4ed9YA2XjRBZhAehe3cZVJZ1lasW37IATS921La2WfKuiysICJOtyT bGjK8tk= =Pt7W -----END PGP SIGNATURE----- Merge tag 'v6.8-rc4' into x86/percpu, to resolve conflicts and refresh the branch Conflicts: arch/x86/include/asm/percpu.h arch/x86/include/asm/text-patching.h Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-02-14 10:45:07 +01:00
Randy Dunlap	7d4002e8ce	x86/insn-eval: Fix function param name in get_eff_addr_sib() Change "regoff" to "base_offset" in 2 places in the kernel-doc comments to prevent warnings: insn-eval.c:1152: warning: Function parameter or member 'base_offset' not described in 'get_eff_addr_sib' insn-eval.c:1152: warning: Excess function parameter 'regoff' description in 'get_eff_addr_sib' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240211062452.16411-1-rdunlap@infradead.org	2024-02-13 22:41:25 +01:00
Steve Wahl	d794734c9b	x86/mm/ident_map: Use gbpages only where full GB page should be mapped. When ident_pud_init() uses only gbpages to create identity maps, large ranges of addresses not actually requested can be included in the resulting table; a 4K request will map a full GB. On UV systems, this ends up including regions that will cause hardware to halt the system if accessed (these are marked "reserved" by BIOS). Even processor speculation into these regions is enough to trigger the system halt. Only use gbpages when map creation requests include the full GB page of space. Fall back to using smaller 2M pages when only portions of a GB page are included in the request. No attempt is made to coalesce mapping requests. If a request requires a map entry at the 2M (pmd) level, subsequent mapping requests within the same 1G region will also be at the pmd level, even if adjacent or overlapping such requests could have been combined to map a full gbpage. Existing usage starts with larger regions and then adds smaller regions, so this should not have any great consequence. [ dhansen: fix up comment formatting, simplifty changelog ] Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com	2024-02-12 14:53:42 -08:00
Kunwu Chan	3693bb4465	x86/xen: Add some null pointer checking to smp.c kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Ensure the allocation was successful by checking the pointer validity. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401161119.iof6BQsf-lkp@intel.com/ Suggested-by: Markus Elfring <Markus.Elfring@web.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240119094948.275390-1-chentao@kylinos.cn Signed-off-by: Juergen Gross <jgross@suse.com>	2024-02-12 20:14:52 +01:00
Josh Poimboeuf	4461438a84	x86/retpoline: Ensure default return thunk isn't used at runtime Make sure the default return thunk is not used after all return instructions have been patched by the alternatives because the default return thunk is insufficient when it comes to mitigating Retbleed or SRSO. Fix based on an earlier version by David Kaplan <david.kaplan@amd.com>. [ bp: Fix the compilation error of warn_thunk_thunk being an invisible symbol, hoist thunk macro into calling.h ] Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231010171020.462211-4-david.kaplan@amd.com Link: https://lore.kernel.org/r/20240104132446.GEZZaxnrIgIyat0pqf@fat_crate.local	2024-02-12 11:42:15 +01:00
Linus Torvalds	c021e191cf	- Correct the minimum CPU family for Transmeta Crusoe in Kconfig so that such hw can boot again - Do not take into accout XSTATE buffer size info supplied by userspace when constructing a sigreturn frame - Switch get_/put_user* to EX_TYPE_UACCESS exception handling when an MCE is encountered so that it can be properly recovered from instead of simply panicking -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXIo3cACgkQEsHwGGHe VUpnvg//THpQodOkgc8SLMut0fx/qcmWTZAxXKBPQklZkBq3sbA6wEDQqvBNkXfl ovSss8TeL0KRrq3OsurJK+QXP94+nFt11q9SEhqPmhGb9d4H7aBimCrNjP0yEE1f YuvkhGhylIPnrwYoJUrK024tuxkFFgIVqr+adv1PrvtohnpVhICJY2oTpxtpQDZi r+k7P7VBG1oNvYETAbljbTQr5KV84YTmZa899/tncZaZbE+18bK/VJhL728ztSzD Xdwoztrf37fqYk03l40MJwJwpiAC5t2g/qwa5yvHjr9Eavb5YeLX34nxeG2AdOpx GTwrWkIW1dY4ck3lC4HR/igd2bDB4ZEfxJMMLkQAIvurGpQjU/jVXC28V4r6N5MW UF1gf4i9m2/BrpX+wpDOi11tl5RQQcV7Y8qsMN1lqRM5sDjjh4PV9oT2TXKmuYn6 2T4Xv0A94FROFkQ9F52MFqTcwh0Yu9vtGsmtbCRP/em5OwqyyVFHWdEFR4PSZUpU 89V7zVFlLWTEuPjrUAU9sQmTL56gNlVmejWAzearhHgeFKUs0EK1hcn310454aVm CzDN+4u8uCHFDKsF915nQnRI6jpRnf3mC4xWYheHcoCg02iSImWwVGGVHbJrWSNV fFYxwWtpFw0N9jzCfUHnElp3jN1Ll1LkkWQC4NvCtZxeUioqKJI= =b7B7 -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Correct the minimum CPU family for Transmeta Crusoe in Kconfig so that such hw can boot again - Do not take into accout XSTATE buffer size info supplied by userspace when constructing a sigreturn frame - Switch get_/put_user* to EX_TYPE_UACCESS exception handling when an MCE is encountered so that it can be properly recovered from instead of simply panicking * tag 'x86_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6 x86/fpu: Stop relying on userspace for info to fault in xsave buffer x86/lib: Revert to _ASM_EXTABLE_UA() for {get,put}_user() fixups	2024-02-11 11:41:51 -08:00
Linus Torvalds	4356e9f841	work around gcc bugs with 'asm goto' with outputs We've had issues with gcc and 'asm goto' before, and we created a 'asm_volatile_goto()' macro for that in the past: see commits `3f0116c323` ("compiler/gcc4: Add quirk for 'asm goto' miscompilation bug") and `a9f180345f` ("compiler/gcc4: Make quirk for asm_volatile_goto() unconditional"). Then, much later, we ended up removing the workaround in commit `43c249ea0b` ("compiler-gcc.h: remove ancient workaround for gcc PR 58670") because we no longer supported building the kernel with the affected gcc versions, but we left the macro uses around. Now, Sean Christopherson reports a new version of a very similar problem, which is fixed by re-applying that ancient workaround. But the problem in question is limited to only the 'asm goto with outputs' cases, so instead of re-introducing the old workaround as-is, let's rename and limit the workaround to just that much less common case. It looks like there are at least two separate issues that all hit in this area: (a) some versions of gcc don't mark the asm goto as 'volatile' when it has outputs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98619 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110420 which is easy to work around by just adding the 'volatile' by hand. (b) Internal compiler errors: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110422 which are worked around by adding the extra empty 'asm' as a barrier, as in the original workaround. but the problem Sean sees may be a third thing since it involves bad code generation (not an ICE) even with the manually added 'volatile'. but the same old workaround works for this case, even if this feels a bit like voodoo programming and may only be hiding the issue. Reported-and-tested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/20240208220604.140859-1-seanjc@google.com/ Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Andrew Pinski <quic_apinski@quicinc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-02-09 15:57:48 -08:00
Linus Torvalds	e6f39a90de	EFI fixes for v6.8 #1 - Tighten ELF relocation checks on the RISC-V EFI stub - Give up if the new EFI memory attributes protocol fails spuriously on x86 - Take care not to place the kernel in the lowest 16 MB of DRAM on x86 - Omit special purpose EFI memory from memblock - Some fixes for the CXL CPER reporting code - Make the PE/COFF layout of mixed-mode capable images comply with a strict interpretation of the spec -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZcDtKAAKCRAwbglWLn0t XMDfAP9ttq8Ir4+hp8A0DGE79x6eSgBIkl5ztGmMQGybzEkzdAEAgxfDUieQW4TT GmbyGGUouvSYxfZf4gVTQn8b/bd57AI= =Af8A -----END PGP SIGNATURE----- Merge tag 'efi-fixes-for-v6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "The only notable change here is the patch that changes the way we deal with spurious errors from the EFI memory attribute protocol. This will be backported to v6.6, and is intended to ensure that we will not paint ourselves into a corner when we tighten this further in order to comply with MS requirements on signed EFI code. Note that this protocol does not currently exist in x86 production systems in the field, only in Microsoft's fork of OVMF, but it will be mandatory for Windows logo certification for x86 PCs in the future. - Tighten ELF relocation checks on the RISC-V EFI stub - Give up if the new EFI memory attributes protocol fails spuriously on x86 - Take care not to place the kernel in the lowest 16 MB of DRAM on x86 - Omit special purpose EFI memory from memblock - Some fixes for the CXL CPER reporting code - Make the PE/COFF layout of mixed-mode capable images comply with a strict interpretation of the spec" * tag 'efi-fixes-for-v6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: x86/efistub: Use 1:1 file:memory mapping for PE/COFF .compat section cxl/trace: Remove unnecessary memcpy's cxl/cper: Fix errant CPER prints for CXL events efi: Don't add memblocks for soft-reserved memory efi: runtime: Fix potential overflow of soft-reserved region size efi/libstub: Add one kernel-doc comment x86/efistub: Avoid placing the kernel below LOAD_PHYSICAL_ADDR x86/efistub: Give up if memory attribute protocol returns an error riscv/efistub: Tighten ELF relocation check riscv/efistub: Ensure GP-relative addressing is not used	2024-02-09 10:40:50 -08:00
Jamie Cunliffe	f82811e22b	rust: Refactor the build target to allow the use of builtin targets Eventually we want all architectures to be using the target as defined by rustc. However currently some architectures can't do that and are using the target.json specification. This puts in place the foundation to allow the use of the builtin target definition or a target.json specification. Signed-off-by: Jamie Cunliffe <Jamie.Cunliffe@arm.com> Acked-by: Masahiro Yamada <masahiroy@kernel.org> Tested-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20231020155056.3495121-2-Jamie.Cunliffe@arm.com [catalin.marinas@arm.com: squashed loongarch ifneq fix from WANG Rui] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2024-02-09 16:11:07 +00:00
Aleksander Mazur	f6a1892585	x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6 The kernel built with MCRUSOE is unbootable on Transmeta Crusoe. It shows the following error message: This kernel requires an i686 CPU, but only detected an i586 CPU. Unable to boot - please use a kernel appropriate for your CPU. Remove MCRUSOE from the condition introduced in commit in Fixes, effectively changing X86_MINIMUM_CPU_FAMILY back to 5 on that machine, which matches the CPU family given by CPUID. [ bp: Massage commit message. ] Fixes: `25d76ac888` ("x86/Kconfig: Explicitly enumerate i686-class CPUs in Kconfig") Signed-off-by: Aleksander Mazur <deweloper@wp.pl> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20240123134309.1117782-1-deweloper@wp.pl	2024-02-09 16:28:19 +01:00
Jakub Kicinski	3be042cf46	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/stmicro/stmmac/common.h `38cc3c6dcc` ("net: stmmac: protect updates of 64-bit statistics counters") `fd5a6a7131` ("net: stmmac: est: Per Tx-queue error count for HLBF") `c5c3e1bfc9` ("net: stmmac: Offload queueMaxSDU from tc-taprio") drivers/net/wireless/microchip/wilc1000/netdev.c `c901388028` ("wifi: fill in MODULE_DESCRIPTION()s for wilc1000") `328efda22a` ("wifi: wilc1000: do not realloc workqueue everytime an interface is added") net/unix/garbage.c `11498715f2` ("af_unix: Remove io_uring code for GC.") `1279f9d9de` ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-08 15:30:33 -08:00
Paolo Bonzini	687d8f4c3d	Merge branch 'kvm-kconfig' Cleanups to Kconfig definitions for KVM * replace HAVE_KVM with an architecture-dependent symbol, when CONFIG_KVM may or may not be available depending on CPU capabilities (MIPS) * replace HAVE_KVM with IS_ENABLED(CONFIG_KVM) for host-side code that is not part of the KVM module, so that it is completely compiled out * factor common "select" statements in common code instead of requiring each architecture to specify it	2024-02-08 08:47:51 -05:00
Paolo Bonzini	f48212ee8e	treewide: remove CONFIG_HAVE_KVM It has no users anymore. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-08 08:45:36 -05:00
Paolo Bonzini	dcf0926e9b	x86: replace CONFIG_HAVE_KVM with IS_ENABLED(CONFIG_KVM) It is more accurate to check if KVM is enabled, instead of having the architecture say so. Architectures always "have" KVM, so for example checking CONFIG_HAVE_KVM in x86 code is pointless, but if KVM is disabled in a specific build, there is no need for support code. Alternatively, many of the #ifdefs could simply be deleted. However, this would add completely dead code. For example, when KVM is disabled, there should not be any posted interrupts, i.e. NOT wiring up the "dummy" handlers and treating IRQs on those vectors as spurious is the right thing to do. Cc: x86@kernel.org Cc: kbingham@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-08 08:45:35 -05:00
Paolo Bonzini	61df71ee99	kvm: move "select IRQ_BYPASS_MANAGER" to common code CONFIG_IRQ_BYPASS_MANAGER is a dependency of the common code included by CONFIG_HAVE_KVM_IRQ_BYPASS. There is no advantage in adding the corresponding "select" directive to each architecture. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-08 08:45:34 -05:00
Paolo Bonzini	6bda055d62	KVM: define __KVM_HAVE_GUEST_DEBUG unconditionally Since all architectures (for historical reasons) have to define struct kvm_guest_debug_arch, and since userspace has to check KVM_CHECK_EXTENSION(KVM_CAP_SET_GUEST_DEBUG) anyway, there is no advantage in masking the capability #define itself. Remove the #define __KVM_HAVE_GUEST_DEBUG from architecture-specific headers. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-08 08:41:06 -05:00
Paolo Bonzini	8886640dad	kvm: replace __KVM_HAVE_READONLY_MEM with Kconfig symbol KVM uses __KVM_HAVE_* symbols in the architecture-dependent uapi/asm/kvm.h to mask unused definitions in include/uapi/linux/kvm.h. __KVM_HAVE_READONLY_MEM however was nothing but a misguided attempt to define KVM_CAP_READONLY_MEM only on architectures where KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM) could possibly return nonzero. This however does not make sense, and it prevented userspace from supporting this architecture-independent feature without recompilation. Therefore, these days __KVM_HAVE_READONLY_MEM does not mask anything and is only used in virt/kvm/kvm_main.c. Userspace does not need to test it and there should be no need for it to exist. Remove it and replace it with a Kconfig symbol within Linux source code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-08 08:41:06 -05:00
Paolo Bonzini	bcac047727	KVM: x86: move x86-specific structs to uapi/asm/kvm.h Several capabilities that exist only on x86 nevertheless have their structs defined in include/uapi/linux/kvm.h. Move them to arch/x86/include/uapi/asm/kvm.h for cleanliness. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-08 08:41:04 -05:00
Paolo Bonzini	458822416a	kvm: x86: use a uapi-friendly macro for GENMASK Change uapi header uses of GENMASK to instead use the uapi/linux/bits.h bit macros, since GENMASK is not defined in uapi headers. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-08 08:41:04 -05:00
Dionna Glaze	882dd4aee3	kvm: x86: use a uapi-friendly macro for BIT Change uapi header uses of BIT to instead use the uapi/linux/const.h bit macros, since BIT is not defined in uapi headers. The PMU mask uses _BITUL since it targets a 32 bit flag field, whereas the longmode definition is meant for a 64 bit flag field. Cc: Sean Christophersen <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Message-Id: <20231207001142.3617856-1-dionnaglaze@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-08 08:41:03 -05:00
Masahiro Yamada	289d0a475c	x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 In arch/x86/Kconfig, COMPAT_32 is defined as (IA32_EMULATION \|\| X86_32). Use it to eliminate redundancy in Makefile. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231121235701.239606-5-masahiroy@kernel.org	2024-02-08 13:23:14 +01:00
Masahiro Yamada	ac9275b3b4	x86/vdso: Use $(addprefix ) instead of $(foreach ) $(addprefix ) is slightly shorter and more intuitive. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231121235701.239606-4-masahiroy@kernel.org	2024-02-08 13:23:01 +01:00
Masahiro Yamada	329b77b59f	x86/vdso: Simplify obj-y addition Add objects to obj-y in a more straightforward way. CONFIG_X86_32 and CONFIG_IA32_EMULATION are not enabled simultaneously, but even if they are, Kbuild graciously deduplicates obj-y entries. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231121235701.239606-3-masahiroy@kernel.org	2024-02-08 13:22:08 +01:00
Masahiro Yamada	31a4ebee0d	x86/vdso: Consolidate targets and clean-files 'targets' and 'clean-files' do not need to list the same files because the files listed in 'targets' are cleaned up. Refactor the code. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231121235701.239606-2-masahiroy@kernel.org	2024-02-08 13:15:41 +01:00
Julian Stecklina	64435aaa4a	KVM: x86: rename push to emulate_push for consistency push and emulate_pop are counterparts. Rename push to emulate_push and harmonize its function signature with emulate_pop. This should remove a bit of cognitive load when reading this code. Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Link: https://lore.kernel.org/r/20231009092054.556935-2-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-07 16:10:01 -08:00
Julian Stecklina	6fd1e3963f	KVM: x86: Clean up partially uninitialized integer in emulate_pop() Explicitly zero out variables passed to emulate_pop() as output params to harden against consuming uninitialized data, and to make sanitizers happy. Many flows that use emulate_pop() pass an "unsigned long" so as to be able to hold the largest possible operand, but the actual number of bytes written is usually the word with of the vCPU. E.g. if the vCPU is in 16-bit or 32-bit mode (on a 64-bit host), the upper portion of the output param will be uninitialized. Passing around the uninitialized data is benign, as actual KVM usage of the output is also tied to the word width, but passing around uninitialized data makes some sanitizers rightly complain. Note, initializing the data in emulate_pop() is not a safe alternative, e.g. it would result in em_leave() clobbering RBP[31:16] if LEAVE were emulated with a 16-bit stack. Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Link: https://lore.kernel.org/r/20231009092054.556935-1-julian.stecklina@cyberus-technology.de [sean: massage changelog, drop em_popa() variable size change]] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-07 16:08:54 -08:00
Thomas Prescher	03f6298c7c	KVM: x86/emulator: emulate movbe with operand-size prefix The MOVBE instruction can come with an operand-size prefix (66h). In this, case the x86 emulation code returns EMULATION_FAILED. It turns out that em_movbe can already handle this case and all that is missing is an entry in respective opcode tables to populate gprefix->pfx_66. Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231212095938.26731-1-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-07 13:16:11 -08:00
Linus Torvalds	5c24ba2055	x86 guest: * Avoid false positive for check that only matters on AMD processors x86: * Give a hint when Win2016 might fail to boot due to XSAVES && !XSAVEC configuration * Do not allow creating an in-kernel PIT unless an IOAPIC already exists RISC-V: * Allow ISA extensions that were enabled for bare metal in 6.8 (Zbc, scalar and vector crypto, Zfh[min], Zihintntl, Zvfh[min], Zfa) S390: * fix CC for successful PQAP instruction * fix a race when creating a shadow page -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmXB9EIUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNF6Qf/VbNzzntY2BBNL6ZReqH+7GqMCMo7 Q8OYsP+B7TWc0C84JNBTmvC5lwY0FmXEV+i9XFUnyMt/eEHEfr/rko1McRf+byAM vcfbTAz8t24bFSfojg7QJGM+pfUTrqjGmWqHwke/DuARsGB8Zntgtb50m966+xso kDtcsrfGOlpHbnnWZQLLQKJ6tVv7Z2/clFlf4gCT/Quex4Jo76Uq08MA9BFS9iw1 e1oftwuXe6pCUcyt1M/AwOe8FnkP+Xm8oVmW0eJgO0TVDwob0Msx2LpVS2N/+/Oj 1mtBSz4rUQyDdI1j6D0+HkdAlNnwEWSV6eQb+qtjXbhIWBOHUpFXNpQWkg== =LVAr -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "x86 guest: - Avoid false positive for check that only matters on AMD processors x86: - Give a hint when Win2016 might fail to boot due to XSAVES && !XSAVEC configuration - Do not allow creating an in-kernel PIT unless an IOAPIC already exists RISC-V: - Allow ISA extensions that were enabled for bare metal in 6.8 (Zbc, scalar and vector crypto, Zfh[min], Zihintntl, Zvfh[min], Zfa) S390: - fix CC for successful PQAP instruction - fix a race when creating a shadow page" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: x86/coco: Define cc_vendor without CONFIG_ARCH_HAS_CC_PLATFORM x86/kvm: Fix SEV check in sev_map_percpu_data() KVM: x86: Give a hint when Win2016 might fail to boot due to XSAVES erratum KVM: x86: Check irqchip mode before create PIT KVM: riscv: selftests: Add Zfa extension to get-reg-list test RISC-V: KVM: Allow Zfa extension for Guest/VM KVM: riscv: selftests: Add Zvfh[min] extensions to get-reg-list test RISC-V: KVM: Allow Zvfh[min] extensions for Guest/VM KVM: riscv: selftests: Add Zihintntl extension to get-reg-list test RISC-V: KVM: Allow Zihintntl extension for Guest/VM KVM: riscv: selftests: Add Zfh[min] extensions to get-reg-list test RISC-V: KVM: Allow Zfh[min] extensions for Guest/VM KVM: riscv: selftests: Add vector crypto extensions to get-reg-list test RISC-V: KVM: Allow vector crypto extensions for Guest/VM KVM: riscv: selftests: Add scaler crypto extensions to get-reg-list test RISC-V: KVM: Allow scalar crypto extensions for Guest/VM KVM: riscv: selftests: Add Zbc extension to get-reg-list test RISC-V: KVM: Allow Zbc extension for Guest/VM KVM: s390: fix cc for successful PQAP KVM: s390: vsie: fix race during shadow creation	2024-02-07 17:52:16 +00:00
Peter Hilber	27f6a9c87a	kvmclock: Unexport kvmclock clocksource The KVM PTP driver now refers to the clocksource ID CSID_X86_KVM_CLK, not to the clocksource itself any more. There are no remaining users of the clocksource export. Therefore, make the clocksource static again. Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240201010453.2212371-9-peter.hilber@opensynergy.com	2024-02-07 17:05:21 +01:00
Peter Hilber	b152688c91	treewide: Remove system_counterval_t.cs, which is never read The clocksource pointer in struct system_counterval_t is not evaluated any more. Remove the code setting the member, and the member itself. Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240201010453.2212371-8-peter.hilber@opensynergy.com	2024-02-07 17:05:21 +01:00
Peter Hilber	576bd4962f	x86/kvm, ptp/kvm: Add clocksource ID, set system_counterval_t.cs_id Add a clocksource ID for the x86 kvmclock. Also, for ptp_kvm, set the recently added struct system_counterval_t member cs_id to the clocksource ID (x86 kvmclock or ARM Generic Timer). In the future, get_device_system_crosststamp() will compare the clocksource ID in struct system_counterval_t, rather than the clocksource. For now, to avoid touching too many subsystems at once, extract the clocksource ID from the clocksource. The clocksource dereference will be removed once everything is converted over.. Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240201010453.2212371-5-peter.hilber@opensynergy.com	2024-02-07 17:05:21 +01:00
Peter Hilber	a2c1fe7206	x86/tsc: Add clocksource ID, set system_counterval_t.cs_id Add a clocksource ID for TSC and a distinct one for the early TSC. Use distinct IDs for TSC and early TSC, since those also have distinct clocksource structs. This should help to keep existing semantics when comparing clocksources. Also, set the recently added struct system_counterval_t member cs_id to the TSC ID in the cases where the clocksource member is being set to the TSC clocksource. In the future, get_device_system_crosststamp() will compare the clocksource ID in struct system_counterval_t, rather than the clocksource. For the x86 ART related code, system_counterval_t.cs == NULL corresponds to system_counterval_t.cs_id == CSID_GENERIC (0). Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240201010453.2212371-4-peter.hilber@opensynergy.com	2024-02-07 17:05:21 +01:00
Randy Dunlap	c55cbfcea6	x86/tsc: Correct kernel-doc notation Add or modify function descriptions to remove kernel-doc warnings: tsc.c:655: warning: missing initial short description on line: * native_calibrate_tsc tsc.c:1339: warning: Excess function parameter 'cycles' description in 'convert_art_ns_to_tsc' tsc.c:1339: warning: Excess function parameter 'cs' description in 'convert_art_ns_to_tsc' tsc.c:1373: warning: Function parameter or member 'work' not described in 'tsc_refine_calibration_work' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20231221033620.32379-1-rdunlap@infradead.org	2024-02-07 17:05:21 +01:00
Chao Gao	d7f0a00e43	KVM: VMX: Report up-to-date exit qualification to userspace Use vmx_get_exit_qual() to read the exit qualification. vcpu->arch.exit_qualification is cached for EPT violation only and even for EPT violation, it is stale at this point because the up-to-date value is cached later in handle_ept_violation(). Fixes: `70bcd708df` ("KVM: vmx: expose more information for KVM_INTERNAL_ERROR_DELIVERY_EV exits") Signed-off-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20231229022652.300095-1-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-07 07:47:53 -08:00
Sean Christopherson	fdd58834d1	KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES Return -EINVAL instead of -EBUSY if userspace attempts KVM_SEV{,ES}_INIT on a VM that already has SEV active. Returning -EBUSY is nonsencial as it's impossible to deactivate SEV without destroying the VM, i.e. the VM isn't "busy" in any sane sense of the word, and the odds of any userspace wanting exactly -EBUSY on a userspace bug are minuscule. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240131235609.4161407-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-06 11:10:12 -08:00
Ashish Kalra	0aa6b90ef9	KVM: SVM: Add support for allowing zero SEV ASIDs Some BIOSes allow the end user to set the minimum SEV ASID value (CPUID 0x8000001F_EDX) to be greater than the maximum number of encrypted guests, or maximum SEV ASID value (CPUID 0x8000001F_ECX) in order to dedicate all the SEV ASIDs to SEV-ES or SEV-SNP. The SEV support, as coded, does not handle the case where the minimum SEV ASID value can be greater than the maximum SEV ASID value. As a result, the following confusing message is issued: [ 30.715724] kvm_amd: SEV enabled (ASIDs 1007 - 1006) Fix the support to properly handle this case. Fixes: `916391a2d1` ("KVM: SVM: Add support for SEV-ES capability in KVM") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Cc: stable@vger.kernel.org Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240104190520.62510-1-Ashish.Kalra@amd.com Link: https://lore.kernel.org/r/20240131235609.4161407-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-06 11:10:11 -08:00
Sean Christopherson	466eec4a22	KVM: SVM: Use unsigned integers when dealing with ASIDs Convert all local ASID variables and parameters throughout the SEV code from signed integers to unsigned integers. As ASIDs are fundamentally unsigned values, and the global min/max variables are appropriately unsigned integers, too. Functionally, this is a glorified nop as KVM guarantees min_sev_asid is non-zero, and no CPU supports -1u as the _only_ asid, i.e. the signed vs. unsigned goof won't cause problems in practice. Opportunistically use sev_get_asid() in sev_flush_encrypted_page() instead of open coding an equivalent. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240131235609.4161407-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-06 11:09:34 -08:00
Sean Christopherson	cc4ce37bed	KVM: SVM: Set sev->asid in sev_asid_new() instead of overloading the return Explicitly set sev->asid in sev_asid_new() when a new ASID is successfully allocated, and return '0' to indicate success instead of overloading the return value to multiplex the ASID with error codes. There is exactly one caller of sev_asid_new(), and sev_asid_free() already consumes sev->asid, i.e. returning the ASID isn't necessary for flexibility, nor does it provide symmetry between related APIs. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240131235609.4161407-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-06 11:08:44 -08:00
Xiaoyao Li	ccb2280ec2	x86/kvm: Use separate percpu variable to track the enabling of asyncpf Refer to commit `fd10cde929` ("KVM paravirt: Add async PF initialization to PV guest") and commit `344d9588a9` ("KVM: Add PV MSR to enable asynchronous page faults delivery"). It turns out that at the time when asyncpf was introduced, the purpose was defining the shared PV data 'struct kvm_vcpu_pv_apf_data' with the size of 64 bytes. However, it made a mistake and defined the size to 68 bytes, which failed to make fit in a cache line and made the code inconsistent with the documentation. Below justification quoted from Sean[] KVM (the host side) has never* read kvm_vcpu_pv_apf_data.enabled, and the documentation clearly states that enabling is based solely on the bit in the synthetic MSR. So rather than update the documentation, fix the goof by removing the enabled filed and use the separate percpu variable instread. KVM-as-a-host obviously doesn't enforce anything or consume the size, and changing the header will only affect guests that are rebuilt against the new header, so there's no chance of ABI breakage between KVM and its guests. The only possible breakage is if some other hypervisor is emulating KVM's async #PF (LOL) and relies on the guest to set kvm_vcpu_pv_apf_data.enabled. But (a) I highly doubt such a hypervisor exists, (b) that would arguably be a violation of KVM's "spec", and (c) the worst case scenario is that the guest would simply lose async #PF functionality. [*] https://lore.kernel.org/all/ZS7ERnnRqs8Fl0ZF@google.com/T/#u Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20231025055914.1201792-2-xiaoyao.li@intel.com [sean: use true/false instead of 1/0 for booleans] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-06 10:58:56 -08:00
Ard Biesheuvel	1c811d403a	x86/sev: Fix position dependent variable references in startup code The early startup code executes from a 1:1 mapping of memory, which differs from the mapping that the code was linked and/or relocated to run at. The latter mapping is not active yet at this point, and so symbol references that rely on it will fault. Given that the core kernel is built without -fPIC, symbol references are typically emitted as absolute, and so any such references occuring in the early startup code will therefore crash the kernel. While an attempt was made to work around this for the early SEV/SME startup code, by forcing RIP-relative addressing for certain global SEV/SME variables via inline assembly (see snp_cpuid_get_table() for example), RIP-relative addressing must be pervasively enforced for SEV/SME global variables when accessed prior to page table fixups. __startup_64() already handles this issue for select non-SEV/SME global variables using fixup_pointer(), which adjusts the pointer relative to a `physaddr` argument. To avoid having to pass around this `physaddr` argument across all functions needing to apply pointer fixups, introduce a macro RIP_RELATIVE_REF() which generates a RIP-relative reference to a given global variable. It is used where necessary to force RIP-relative accesses to global variables. For backporting purposes, this patch makes no attempt at cleaning up other occurrences of this pattern, involving either inline asm or fixup_pointer(). Those will be addressed later. [ bp: Call it "rip_rel_ref" everywhere like other code shortens "rIP-relative reference" and make the asm wrapper __always_inline. ] Co-developed-by: Kevin Loughlin <kevinloughlin@google.com> Signed-off-by: Kevin Loughlin <kevinloughlin@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/all/20240130220845.1978329-1-kevinloughlin@google.com	2024-02-06 16:38:42 +01:00
Kees Cook	918327e9b7	ubsan: Remove CONFIG_UBSAN_SANITIZE_ALL For simplicity in splitting out UBSan options into separate rules, remove CONFIG_UBSAN_SANITIZE_ALL, effectively defaulting to "y", which is how it is generally used anyway. (There are no ":= y" cases beyond where a specific file is enabled when a top-level ":= n" is in effect.) Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: linux-doc@vger.kernel.org Cc: linux-kbuild@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2024-02-06 02:21:38 -08:00
NOMURA JUNICHI(野村　淳一)	ac456ca0af	x86/boot: Add a message about ignored early NMIs Commit `78a509fba9` ("x86/boot: Ignore NMIs during very early boot") added an empty handler in early boot stage to avoid boot failure due to spurious NMIs. Add a diagnostic message to show that early NMIs have occurred. [ bp: Touchups. ] [ Committer note: tested by stopping the guest really early and injecting NMIs through qemu's monitor. Result: early console in setup code Spurious early NMIs ignored: 13 ... ] Suggested-by: Borislav Petkov <bp@alien8.de> Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Jun'ichi Nomura <junichi.nomura@nec.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/lkml/20231130103339.GCZWhlA196uRklTMNF@fat_crate.local	2024-02-06 10:51:11 +01:00
H. Peter Anvin	9ba8ec8ee6	x86/boot: Add error_putdec() helper Add a helper to print decimal numbers to early console. Suggested-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Jun'ichi Nomura <junichi.nomura@nec.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/lkml/20240123112624.GBZa-iYP1l9SSYtr-V@fat_crate.local/ Link: https://lore.kernel.org/r/20240202035052.17963-1-junichi.nomura@nec.com	2024-02-06 10:50:21 +01:00
Nathan Chancellor	e459647710	x86/coco: Define cc_vendor without CONFIG_ARCH_HAS_CC_PLATFORM After commit `a9ef277488` ("x86/kvm: Fix SEV check in sev_map_percpu_data()"), there is a build error when building x86_64_defconfig with GCOV using LLVM: ld.lld: error: undefined symbol: cc_vendor >>> referenced by kvm.c >>> arch/x86/kernel/kvm.o:(kvm_smp_prepare_boot_cpu) in archive vmlinux.a which corresponds to if (cc_vendor != CC_VENDOR_AMD \|\| !cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) return; Without GCOV, clang is able to eliminate the use of cc_vendor because cc_platform_has() evaluates to false when CONFIG_ARCH_HAS_CC_PLATFORM is not set, meaning that if statement will be true no matter what value cc_vendor has. With GCOV, the instrumentation keeps the use of cc_vendor around for code coverage purposes but cc_vendor is only declared, not defined, without CONFIG_ARCH_HAS_CC_PLATFORM, leading to the build error above. Provide a macro definition of cc_vendor when CONFIG_ARCH_HAS_CC_PLATFORM is not set with a value of CC_VENDOR_NONE, so that the first condition can always be evaluated/eliminated at compile time, avoiding the build error altogether. This is very similar to the situation prior to commit `da86eb9611` ("x86/coco: Get rid of accessor functions"). Signed-off-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Message-Id: <20240202-provide-cc_vendor-without-arch_has_cc_platform-v1-1-09ad5f2a3099@kernel.org> Fixes: `a9ef277488` ("x86/kvm: Fix SEV check in sev_map_percpu_data()", 2024-01-31) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-06 03:56:04 -05:00
Mathias Krause	e1dda3afe2	KVM: x86: Fix broken debugregs ABI for 32 bit kernels The ioctl()s to get and set KVM's debug registers are broken for 32 bit kernels as they'd only copy half of the user register state because of a UAPI and in-kernel type mismatch (__u64 vs. unsigned long; 8 vs. 4 bytes). This makes it impossible for userland to set anything but DR0 without resorting to bit folding tricks. Switch to a loop for copying debug registers that'll implicitly do the type conversion for us, if needed. There are likely no users (left) for 32bit KVM, fix the bug nonetheless. Fixes: `a1efbe77c1` ("KVM: x86: Add support for saving&restoring debug registers") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240203124522.592778-4-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-05 15:40:54 -08:00
Mathias Krause	3376ca3f1a	KVM: x86: Fix KVM_GET_MSRS stack info leak Commit `6abe9c1386` ("KVM: X86: Move ignore_msrs handling upper the stack") changed the 'ignore_msrs' handling, including sanitizing return values to the caller. This was fine until commit `12bc2132b1` ("KVM: X86: Do the same ignore_msrs check for feature msrs") which allowed non-existing feature MSRs to be ignored, i.e. to not generate an error on the ioctl() level. It even tried to preserve the sanitization of the return value. However, the logic is flawed, as '*data' will be overwritten again with the uninitialized stack value of msr.data. Fix this by simplifying the logic and always initializing msr.data, vanishing the need for an additional error exit path. Fixes: `12bc2132b1` ("KVM: X86: Do the same ignore_msrs check for feature msrs") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240203124522.592778-2-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-05 11:20:51 -08:00
Ard Biesheuvel	1ad55cecf2	x86/efistub: Use 1:1 file:memory mapping for PE/COFF .compat section The .compat section is a dummy PE section that contains the address of the 32-bit entrypoint of the 64-bit kernel image if it is bootable from 32-bit firmware (i.e., CONFIG_EFI_MIXED=y) This section is only 8 bytes in size and is only referenced from the loader, and so it is placed at the end of the memory view of the image, to avoid the need for padding it to 4k, which is required for sections appearing in the middle of the image. Unfortunately, this violates the PE/COFF spec, and even if most EFI loaders will work correctly (including the Tianocore reference implementation), PE loaders do exist that reject such images, on the basis that both the file and memory views of the file contents should be described by the section headers in a monotonically increasing manner without leaving any gaps. So reorganize the sections to avoid this issue. This results in a slight padding overhead (< 4k) which can be avoided if desired by disabling CONFIG_EFI_MIXED (which is only needed in rare cases these days) Fixes: `3e3eabe26d` ("x86/boot: Increase section and file alignment to 4k/512") Reported-by: Mike Beaton <mjsbeaton@gmail.com> Link: https://lkml.kernel.org/r/CAHzAAWQ6srV6LVNdmfbJhOwhBw5ZzxxZZ07aHt9oKkfYAdvuQQ%40mail.gmail.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2024-02-05 10:24:51 +00:00
Ricardo B. Marliere	a6a789165b	x86/mce: Make mce_subsys const Now that the driver core can properly handle constant struct bus_type, make mce_subsys a constant structure. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20240204-bus_cleanup-x86-v1-1-4e7171be88e8@marliere.net	2024-02-05 10:26:51 +01:00
Borislav Petkov (AMD)	2995674833	x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT It was meant well at the time but nothing's using it so get rid of it. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240202163510.GDZb0Zvj8qOndvFOiZ@fat_crate.local	2024-02-03 11:38:17 +01:00
Mingwei Zhang	05519c86d6	KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl Use a u64 instead of a u8 when taking a snapshot of pmu->fixed_ctr_ctrl when reprogramming fixed counters, as truncating the value results in KVM thinking fixed counter 2 is already disabled (the bug also affects fixed counters 3+, but KVM doesn't yet support those). As a result, if the guest disables fixed counter 2, KVM will get a false negative and fail to reprogram/disable emulation of the counter, which can leads to incorrect counts and spurious PMIs in the guest. Fixes: `76d287b234` ("KVM: x86/pmu: Drop "u8 ctrl, int idx" for reprogram_fixed_counter()") Cc: stable@vger.kernel.org Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20240123221220.3911317-1-mizhang@google.com [sean: rewrite changelog to call out the effects of the bug] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-02 14:07:27 -08:00
Xin Li (Intel)	cba9ff3345	x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly Change array_index_mask_nospec() to __always_inline because "inline" is broken as https://www.kernel.org/doc/local/inline.html. Fixes: 6786137bf8fd ("x86/fred: FRED entry/exit and dispatch code") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240202090225.322544-1-xin@zytor.com	2024-02-02 10:05:55 +01:00
Jakub Kicinski	cf244463a2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-01 15:12:37 -08:00
Linus Torvalds	a412682659	Kbuild fixes for v6.8 - Fix UML build with clang-18 and newer - Avoid using the alias attribute in host programs - Replace tabs with spaces when followed by conditionals for future GNU Make versions - Fix rpm-pkg for the systemd-provided kernel-install tool - Fix the undefined behavior in Kconfig for a 'int' symbol used in a conditional -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmW7nmkVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsGZvAP/3E1+nGzo7EQNyew+pJiY+Tq4qxN NV/O/XM1aupQICq4tm5oyp04FFg87z3RYs3IEEqg0Eqi/3o/8udLDj3f4tPignz5 G+C4IMYel+mrcSUvZYEDy7avDwEJwdsh28iv4wJb660gyUyRPEd7sQa1SKA3P4nq 6g2+aDegRGXLZkdz47KjnlIsx4gF+ZYX/n6gZe7xSGQWrmgWP/qhuEkog7YfLIMe uIXFD1f0gP0dMYSjiuXFLf+4JTUYi6cHPkAgprv7HAReUoceie99KcNgRkqBTL+I MKAt+GxEVL36FKeFKzobjgUrzX2wruY5o9egxGG7W+xYrM4n/oA2rExf94gR/Qyj 1jGT1vM6aTO51JxhINEX0ZBD0E+oaO6H0z26seOMDMcKZlw2dkwNmUCyPu9O9DH3 bMv1qVZvjBVU0Jn9IIQ+m0nXCmns3W84lJEvFMUkW2TMVoYKwjOaU+7XK8DVKJ5T Lr6FxCzk2CCYiL8VOO53YBG6csPrsRqXriP3RvmaZTW7B/6qPqkCAS0yyKILg/Os 83vBB0vOaLXXor+DIk2E0H0fa/wFlc3VrBe07lFkGQefG1/PpchFU7B44DklDUqo f9zHPnTwrdGpV1hfnGmUS2aDISbgPKeXgcQgZeNLUDQtj6BM+UPjN+0jmH18RL5i OvLACtAJyrcssLAr =Rn0I -----END PGP SIGNATURE----- Merge tag 'kbuild-fixes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Fix UML build with clang-18 and newer - Avoid using the alias attribute in host programs - Replace tabs with spaces when followed by conditionals for future GNU Make versions - Fix rpm-pkg for the systemd-provided kernel-install tool - Fix the undefined behavior in Kconfig for a 'int' symbol used in a conditional * tag 'kbuild-fixes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kconfig: initialize sym->curr.tri to 'no' for all symbol types again kbuild: rpm-pkg: simplify installkernel %post kbuild: Replace tabs with spaces when followed by conditionals modpost: avoid using the alias attribute kbuild: fix W= flags in the help message modpost: Add '.ltext' and '.ltext.*' to TEXT_SECTIONS um: Fix adding '-no-pie' for clang kbuild: defconf: use SRCARCH to find merged configs	2024-02-01 11:57:42 -08:00
Tanzir Hasan	66a5c40f60	kernel.h: removed REPEAT_BYTE from kernel.h This patch creates wordpart.h and includes it in asm/word-at-a-time.h for all architectures. WORD_AT_A_TIME_CONSTANTS depends on kernel.h because of REPEAT_BYTE. Moving this to another header and including it where necessary allows us to not include the bloated kernel.h. Making this implicit dependency on REPEAT_BYTE explicit allows for later improvements in the lib/string.c inclusion list. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Tanzir Hasan <tanzirh@google.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20231226-libstringheader-v6-1-80aa08c7652c@google.com Signed-off-by: Kees Cook <keescook@chromium.org>	2024-02-01 09:47:59 -08:00
Sean Christopherson	83bdfe04c9	KVM: x86/pmu: Avoid CPL lookup if PMC enabline for USER and KERNEL is the same Don't bother querying the CPL if a PMC is (not) counting for both USER and KERNEL, i.e. if the end result is guaranteed to be the same regardless of the CPL. Querying the CPL on Intel requires a VMREAD, i.e. isn't free, and a single CMP+Jcc is cheap. Link: https://lore.kernel.org/r/20231110022857.1273836-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-01 09:35:48 -08:00
Sean Christopherson	e35529fb4a	KVM: x86/pmu: Check eventsel first when emulating (branch) insns retired When triggering events, i.e. emulating PMC events in software, check for a matching event selector before checking the event is allowed. The "is allowed" check might be cheap, but it could also be very costly, e.g. if userspace has defined a large PMU event filter. The event selector check on the other hand is all but guaranteed to be <10 uops, e.g. looks something like: 0xffffffff8105e615 <+5>: movabs $0xf0000ffff,%rax 0xffffffff8105e61f <+15>: xor %rdi,%rsi 0xffffffff8105e622 <+18>: test %rax,%rsi 0xffffffff8105e625 <+21>: sete %al Link: https://lore.kernel.org/r/20231110022857.1273836-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-01 09:35:48 -08:00
Sean Christopherson	afda2d7666	KVM: x86/pmu: Expand the comment about what bits are check emulating events Expand the comment about what bits are and aren't checked when emulating PMC events in software. As pointed out by Jim, AMD's mask includes bits 35:32, which on Intel overlap with the IN_TX and IN_TXCP bits (32 and 33) as well as reserved bits (34 and 45). Checking The IN_TX* bits is actually correct, as it's safe to assert that the vCPU can't be in an HLE/RTM transaction if KVM is emulating an instruction, i.e. KVM shouldn't count if either of those bits is set. For the reserved bits, KVM is has equal odds of being right if Intel adds new behavior, i.e. ignoring them is just as likely to be correct as checking them. Opportunistically explain why* the other flags aren't checked. Suggested-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20231110022857.1273836-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-01 09:35:48 -08:00

... 3 4 5 6 7 ...

46089 Commits