linux

iv/linux

Author	SHA1	Message	Date
Nikita Zhandarovich	8d26a4fecc	x86/mm: Fix use of uninitialized buffer in sme_enable() commit cbebd68f59f03633469f3ecf9bea99cd6cce3854 upstream. cmdline_find_option() may fail before doing any initialization of the buffer array. This may lead to unpredictable results when the same buffer is used later in calls to strncmp() function. Fix the issue by returning early if cmdline_find_option() returns an error. Found by Linux Verification Center (linuxtesting.org) with static analysis tool SVACE. Fixes: aca20d546214 ("x86/mm: Add support to make use of Secure Memory Encryption") Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230306160656.14844-1-n.zhandarovich@fintech.ru Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-22 13:28:09 +01:00
Paolo Bonzini	65e4c9a6d0	KVM: nVMX: add missing consistency checks for CR0 and CR4 commit 112e66017bff7f2837030f34c2bc19501e9212d5 upstream. The effective values of the guest CR0 and CR4 registers may differ from those included in the VMCS12. In particular, disabling EPT forces CR4.PAE=1 and disabling unrestricted guest mode forces CR0.PG=CR0.PE=1. Therefore, checks on these bits cannot be delegated to the processor and must be performed by KVM. Reported-by: Reima ISHII <ishiir@g.ecc.u-tokyo.ac.jp> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-22 13:28:09 +01:00
H.J. Lu	a4bd6d4df3	x86, vmlinux.lds: Add RUNTIME_DISCARD_EXIT to generic DISCARDS commit 84d5f77fc2ee4e010c2c037750e32f06e55224b0 upstream. In the x86 kernel, .exit.text and .exit.data sections are discarded at runtime, not by the linker. Add RUNTIME_DISCARD_EXIT to generic DISCARDS and define it in the x86 kernel linker script to keep them. The sections are added before the DISCARD directive so document here only the situation explicitly as this change doesn't have any effect on the generated kernel. Also, other architectures like ARM64 will use it too so generalize the approach with the RUNTIME_DISCARD_EXIT define. [ bp: Massage and extend commit message. ] Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20200326193021.255002-1-hjl.tools@gmail.com Signed-off-by: Tom Saeger <tom.saeger@oracle.com>	2023-03-17 08:32:53 +01:00
Andrew Cooper	e40c1e9da1	x86/CPU/AMD: Disable XSAVES on AMD family 0x17 commit b0563468eeac88ebc70559d52a0b66efc37e4e9d upstream. AMD Erratum 1386 is summarised as: XSAVES Instruction May Fail to Save XMM Registers to the Provided State Save Area This piece of accidental chronomancy causes the %xmm registers to occasionally reset back to an older value. Ignore the XSAVES feature on all AMD Zen1/2 hardware. The XSAVEC instruction (which works fine) is equivalent on affected parts. [ bp: Typos, move it into the F17h-specific function. ] Reported-by: Tavis Ormandy <taviso@gmail.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230307174643.1240184-1-andrew.cooper3@citrix.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-17 08:32:47 +01:00
Linus Torvalds	27c64d90d9	x86/resctl: fix scheduler confusion with 'current' commit 7fef099702527c3b2c5234a2ea6a24411485a13a upstream. The implementation of 'current' on x86 is very intentionally special: it is a very common thing to look up, and it uses 'this_cpu_read_stable()' to get the current thread pointer efficiently from per-cpu storage. And the keyword in there is 'stable': the current thread pointer never changes as far as a single thread is concerned. Even if when a thread is preempted, or moved to another CPU, or even across an explicit call 'schedule()' that thread will still have the same value for 'current'. It is, after all, the kernel base pointer to thread-local storage. That's why it's stable to begin with, but it's also why it's important enough that we have that special 'this_cpu_read_stable()' access for it. So this is all done very intentionally to allow the compiler to treat 'current' as a value that never visibly changes, so that the compiler can do CSE and combine multiple different 'current' accesses into one. However, there is obviously one very special situation when the currently running thread does actually change: inside the scheduler itself. So the scheduler code paths are special, and do not have a 'current' thread at all. Instead there are _two_ threads: the previous and the next thread - typically called 'prev' and 'next' (or prev_p/next_p) internally. So this is all actually quite straightforward and simple, and not all that complicated. Except for when you then have special code that is run in scheduler context, that code then has to be aware that 'current' isn't really a valid thing. Did you mean 'prev'? Did you mean 'next'? In fact, even if then look at the code, and you use 'current' after the new value has been assigned to the percpu variable, we have explicitly told the compiler that 'current' is magical and always stable. So the compiler is quite free to use an older (or newer) value of 'current', and the actual assignment to the percpu storage is not relevant even if it might look that way. Which is exactly what happened in the resctl code, that blithely used 'current' in '__resctrl_sched_in()' when it really wanted the new process state (as implied by the name: we're scheduling 'into' that new resctl state). And clang would end up just using the old thread pointer value at least in some configurations. This could have happened with gcc too, and purely depends on random compiler details. Clang just seems to have been more aggressive about moving the read of the per-cpu current_task pointer around. The fix is trivial: just make the resctl code adhere to the scheduler rules of using the prev/next thread pointer explicitly, instead of using 'current' in a situation where it just wasn't valid. That same code is then also used outside of the scheduler context (when a thread resctl state is explicitly changed), and then we will just pass in 'current' as that pointer, of course. There is no ambiguity in that case. The fix may be trivial, but noticing and figuring out what went wrong was not. The credit for that goes to Stephane Eranian. Reported-by: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/ Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/ Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: Stephane Eranian <eranian@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:16 +01:00
Valentin Schneider	81da72aaf5	x86/resctrl: Apply READ_ONCE/WRITE_ONCE to task_struct.{rmid,closid} commit 6d3b47ddffed70006cf4ba360eef61e9ce097d8f upstream. A CPU's current task can have its {closid, rmid} fields read locally while they are being concurrently written to from another CPU. This can happen anytime __resctrl_sched_in() races with either __rdtgroup_move_task() or rdt_move_group_tasks(). Prevent load / store tearing for those accesses by giving them the READ_ONCE() / WRITE_ONCE() treatment. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/9921fda88ad81afb9885b517fbe864a2bc7c35a9.1608243147.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:16 +01:00
Darrell Kavanagh	fde59e273b	firmware/efi sysfb_efi: Add quirk for Lenovo IdeaPad Duet 3 [ Upstream commit e1d447157f232c650e6f32c9fb89ff3d0207c69a ] Another Lenovo convertable which reports a landscape resolution of 1920x1200 with a pitch of (1920 * 4) bytes, while the actual framebuffer has a resolution of 1200x1920 with a pitch of (1200 * 4) bytes. Signed-off-by: Darrell Kavanagh <darrell.kavanagh@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-03-11 16:44:13 +01:00
Ammar Faizi	a8816afcaf	x86: um: vdso: Add '%rcx' and '%r11' to the syscall clobber list [ Upstream commit 5541992e512de8c9133110809f767bd1b54ee10d ] The 'syscall' instruction clobbers '%rcx' and '%r11', but they are not listed in the inline Assembly that performs the syscall instruction. No real bug is found. It wasn't buggy by luck because '%rcx' and '%r11' are caller-saved registers, and not used in the functions, and the functions are never inlined. Add them to the clobber list for code correctness. Fixes: f1c2bb8b9964ed31de988910f8b1cfb586d30091 ("um: implement a x86_64 vDSO") Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-03-11 16:44:10 +01:00
KP Singh	34c1b60e7a	x86/speculation: Allow enabling STIBP with legacy IBRS commit 6921ed9049bc7457f66c1596c5b78aec0dae4a9d upstream. When plain IBRS is enabled (not enhanced IBRS), the logic in spectre_v2_user_select_mitigation() determines that STIBP is not needed. The IBRS bit implicitly protects against cross-thread branch target injection. However, with legacy IBRS, the IBRS bit is cleared on returning to userspace for performance reasons which leaves userspace threads vulnerable to cross-thread branch target injection against which STIBP protects. Exclude IBRS from the spectre_v2_in_ibrs_mode() check to allow for enabling STIBP (through seccomp/prctl() by default or always-on, if selected by spectre_v2_user kernel cmdline parameter). [ bp: Massage. ] Fixes: 7c693f54c873 ("x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS") Reported-by: José Oliveira <joseloliveira11@gmail.com> Reported-by: Rodrigo Branco <rodrigo@kernelhacking.com> Signed-off-by: KP Singh <kpsingh@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230220120127.1975241-1-kpsingh@kernel.org Link: https://lore.kernel.org/r/20230221184908.2349578-1-kpsingh@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:02 +01:00
Borislav Petkov (AMD)	979e197968	x86/microcode/AMD: Fix mixed steppings support commit 7ff6edf4fef38ab404ee7861f257e28eaaeed35f upstream. The AMD side of the loader has always claimed to support mixed steppings. But somewhere along the way, it broke that by assuming that the cached patch blob is a single one instead of it being one per node. So turn it into a per-node one so that each node can stash the blob relevant for it. [ NB: Fixes tag is not really the exactly correct one but it is good enough. ] Fixes: fe055896c040 ("x86/microcode: Merge the early microcode loader") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> # 2355370cd941 ("x86/microcode/amd: Remove load_microcode_amd()'s bsp parameter") Cc: <stable@kernel.org> # a5ad92134bd1 ("x86/microcode/AMD: Add a @cpu parameter to the reloading functions") Link: https://lore.kernel.org/r/20230130161709.11615-4-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:02 +01:00
Borislav Petkov (AMD)	727bc2c285	x86/microcode/AMD: Add a @cpu parameter to the reloading functions commit a5ad92134bd153a9ccdcddf09a95b088f36c3cce upstream. Will be used in a subsequent change. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230130161709.11615-3-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:02 +01:00
Borislav Petkov (AMD)	4c26edf2ea	x86/microcode/amd: Remove load_microcode_amd()'s bsp parameter commit 2355370cd941cbb20882cc3f34460f9f2b8f9a18 upstream. It is always the BSP. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230130161709.11615-2-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:02 +01:00
Yang Jihong	a0415b79dd	x86/kprobes: Fix arch_check_optimized_kprobe check within optimized_kprobe range commit f1c97a1b4ef709e3f066f82e3ba3108c3b133ae6 upstream. When arch_prepare_optimized_kprobe calculating jump destination address, it copies original instructions from jmp-optimized kprobe (see __recover_optprobed_insn), and calculated based on length of original instruction. arch_check_optimized_kprobe does not check KPROBE_FLAG_OPTIMATED when checking whether jmp-optimized kprobe exists. As a result, setup_detour_execution may jump to a range that has been overwritten by jump destination address, resulting in an inval opcode error. For example, assume that register two kprobes whose addresses are <func+9> and <func+11> in "func" function. The original code of "func" function is as follows: 0xffffffff816cb5e9 <+9>: push %r12 0xffffffff816cb5eb <+11>: xor %r12d,%r12d 0xffffffff816cb5ee <+14>: test %rdi,%rdi 0xffffffff816cb5f1 <+17>: setne %r12b 0xffffffff816cb5f5 <+21>: push %rbp 1.Register the kprobe for <func+11>, assume that is kp1, corresponding optimized_kprobe is op1. After the optimization, "func" code changes to: 0xffffffff816cc079 <+9>: push %r12 0xffffffff816cc07b <+11>: jmp 0xffffffffa0210000 0xffffffff816cc080 <+16>: incl 0xf(%rcx) 0xffffffff816cc083 <+19>: xchg %eax,%ebp 0xffffffff816cc084 <+20>: (bad) 0xffffffff816cc085 <+21>: push %rbp Now op1->flags == KPROBE_FLAG_OPTIMATED; 2. Register the kprobe for <func+9>, assume that is kp2, corresponding optimized_kprobe is op2. register_kprobe(kp2) register_aggr_kprobe alloc_aggr_kprobe __prepare_optimized_kprobe arch_prepare_optimized_kprobe __recover_optprobed_insn // copy original bytes from kp1->optinsn.copied_insn, // jump address = <func+14> 3. disable kp1: disable_kprobe(kp1) __disable_kprobe ... if (p == orig_p \|\| aggr_kprobe_disabled(orig_p)) { ret = disarm_kprobe(orig_p, true) // add op1 in unoptimizing_list, not unoptimized orig_p->flags \|= KPROBE_FLAG_DISABLED; // op1->flags == KPROBE_FLAG_OPTIMATED \| KPROBE_FLAG_DISABLED ... 4. unregister kp2 __unregister_kprobe_top ... if (!kprobe_disabled(ap) && !kprobes_all_disarmed) { optimize_kprobe(op) ... if (arch_check_optimized_kprobe(op) < 0) // because op1 has KPROBE_FLAG_DISABLED, here not return return; p->kp.flags \|= KPROBE_FLAG_OPTIMIZED; // now op2 has KPROBE_FLAG_OPTIMIZED } "func" code now is: 0xffffffff816cc079 <+9>: int3 0xffffffff816cc07a <+10>: push %rsp 0xffffffff816cc07b <+11>: jmp 0xffffffffa0210000 0xffffffff816cc080 <+16>: incl 0xf(%rcx) 0xffffffff816cc083 <+19>: xchg %eax,%ebp 0xffffffff816cc084 <+20>: (bad) 0xffffffff816cc085 <+21>: push %rbp 5. if call "func", int3 handler call setup_detour_execution: if (p->flags & KPROBE_FLAG_OPTIMIZED) { ... regs->ip = (unsigned long)op->optinsn.insn + TMPL_END_IDX; ... } The code for the destination address is 0xffffffffa021072c: push %r12 0xffffffffa021072e: xor %r12d,%r12d 0xffffffffa0210731: jmp 0xffffffff816cb5ee <func+14> However, <func+14> is not a valid start instruction address. As a result, an error occurs. Link: https://lore.kernel.org/all/20230216034247.32348-3-yangjihong1@huawei.com/ Fixes: f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code") Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Cc: stable@vger.kernel.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:02 +01:00
Yang Jihong	ec206a38d3	x86/kprobes: Fix __recover_optprobed_insn check optimizing logic commit 868a6fc0ca2407622d2833adefe1c4d284766c4c upstream. Since the following commit: commit f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code") modified the update timing of the KPROBE_FLAG_OPTIMIZED, a optimized_kprobe may be in the optimizing or unoptimizing state when op.kp->flags has KPROBE_FLAG_OPTIMIZED and op->list is not empty. The __recover_optprobed_insn check logic is incorrect, a kprobe in the unoptimizing state may be incorrectly determined as unoptimizing. As a result, incorrect instructions are copied. The optprobe_queued_unopt function needs to be exported for invoking in arch directory. Link: https://lore.kernel.org/all/20230216034247.32348-2-yangjihong1@huawei.com/ Fixes: f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code") Cc: stable@vger.kernel.org Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:01 +01:00
Sean Christopherson	e4ce333cc6	x86/reboot: Disable SVM, not just VMX, when stopping CPUs commit a2b07fa7b93321c059af0c6d492cc9a4f1e390aa upstream. Disable SVM and more importantly force GIF=1 when halting a CPU or rebooting the machine. Similar to VMX, SVM allows software to block INITs via CLGI, and thus can be problematic for a crash/reboot. The window for failure is smaller with SVM as INIT is only blocked while GIF=0, i.e. between CLGI and STGI, but the window does exist. Fixes: fba4f472b33a ("x86/reboot: Turn off KVM when halting a CPU") Cc: stable@vger.kernel.org Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221130233650.1404148-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:01 +01:00
Sean Christopherson	37459195d9	x86/reboot: Disable virtualization in an emergency if SVM is supported commit d81f952aa657b76cea381384bef1fea35c5fd266 upstream. Disable SVM on all CPUs via NMI shootdown during an emergency reboot. Like VMX, SVM can block INIT, e.g. if the emergency reboot is triggered between CLGI and STGI, and thus can prevent bringing up other CPUs via INIT-SIPI-SIPI. Cc: stable@vger.kernel.org Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221130233650.1404148-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:01 +01:00
Sean Christopherson	87459b9fce	x86/crash: Disable virt in core NMI crash handler to avoid double shootdown commit 26044aff37a5455b19a91785086914fd33053ef4 upstream. Disable virtualization in crash_nmi_callback() and rework the emergency_vmx_disable_all() path to do an NMI shootdown if and only if a shootdown has not already occurred. NMI crash shootdown fundamentally can't support multiple invocations as responding CPUs are deliberately put into halt state without unblocking NMIs. But, the emergency reboot path doesn't have any work of its own, it simply cares about disabling virtualization, i.e. so long as a shootdown occurred, emergency reboot doesn't care who initiated the shootdown, or when. If "crash_kexec_post_notifiers" is specified on the kernel command line, panic() will invoke crash_smp_send_stop() and result in a second call to nmi_shootdown_cpus() during native_machine_emergency_restart(). Invoke the callback _before_ disabling virtualization, as the current VMCS needs to be cleared before doing VMXOFF. Note, this results in a subtle change in ordering between disabling virtualization and stopping Intel PT on the responding CPUs. While VMX and Intel PT do interact, VMXOFF and writes to MSR_IA32_RTIT_CTL do not induce faults between one another, which is all that matters when panicking. Harden nmi_shootdown_cpus() against multiple invocations to try and capture any such kernel bugs via a WARN instead of hanging the system during a crash/dump, e.g. prior to the recent hardening of register_nmi_handler(), re-registering the NMI handler would trigger a double list_add() and hang the system if CONFIG_BUG_ON_DATA_CORRUPTION=y. list_add double add: new=ffffffff82220800, prev=ffffffff8221cfe8, next=ffffffff82220800. WARNING: CPU: 2 PID: 1319 at lib/list_debug.c:29 __list_add_valid+0x67/0x70 Call Trace: __register_nmi_handler+0xcf/0x130 nmi_shootdown_cpus+0x39/0x90 native_machine_emergency_restart+0x1c9/0x1d0 panic+0x237/0x29b Extract the disabling logic to a common helper to deduplicate code, and to prepare for doing the shootdown in the emergency reboot path if SVM is supported. Note, prior to commit ed72736183c4 ("x86/reboot: Force all cpus to exit VMX root if VMX is supported"), nmi_shootdown_cpus() was subtly protected against a second invocation by a cpu_vmx_enabled() check as the kdump handler would disable VMX if it ran first. Fixes: ed72736183c4 ("x86/reboot: Force all cpus to exit VMX root if VMX is supported") Cc: stable@vger.kernel.org Reported-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/all/20220427224924.592546-2-gpiccoli@igalia.com Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221130233650.1404148-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:01 +01:00
Sean Christopherson	ee80fb1dca	x86/virt: Force GIF=1 prior to disabling SVM (for reboot flows) commit 6a3236580b0b1accc3976345e723104f74f6f8e6 upstream. Set GIF=1 prior to disabling SVM to ensure that INIT is recognized if the kernel is disabling SVM in an emergency, e.g. if the kernel is about to jump into a crash kernel or may reboot without doing a full CPU RESET. If GIF is left cleared, the new kernel (or firmware) will be unabled to awaken APs. Eat faults on STGI (due to EFER.SVME=0) as it's possible that SVM could be disabled via NMI shootdown between reading EFER.SVME and executing STGI. Link: https://lore.kernel.org/all/cbcb6f35-e5d7-c1c9-4db9-fe5cc4de579a@amd.com Cc: stable@vger.kernel.org Cc: Andrew Cooper <Andrew.Cooper3@citrix.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221130233650.1404148-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-11 16:44:01 +01:00
Breno Leitao	f3a324362b	x86/bugs: Reset speculation control settings on init [ Upstream commit 0125acda7d76b943ca55811df40ed6ec0ecf670f ] Currently, x86_spec_ctrl_base is read at boot time and speculative bits are set if Kconfig items are enabled. For example, IBRS is enabled if CONFIG_CPU_IBRS_ENTRY is configured, etc. These MSR bits are not cleared if the mitigations are disabled. This is a problem when kexec-ing a kernel that has the mitigation disabled from a kernel that has the mitigation enabled. In this case, the MSR bits are not cleared during the new kernel boot. As a result, this might have some performance degradation that is hard to pinpoint. This problem does not happen if the machine is (hard) rebooted because the bit will be cleared by default. [ bp: Massage. ] Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20221128153148.1129350-1-leitao@debian.org Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-03-11 16:43:54 +01:00
Eric Biggers	c300697690	crypto: x86/ghash - fix unaligned access in ghash_setkey() [ Upstream commit 116db2704c193fff6d73ea6c2219625f0c9bdfc8 ] The key can be unaligned, so use the unaligned memory access helpers. Fixes: 8ceee72808d1 ("crypto: ghash-clmulni-intel - use C implementation for setkey()") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-03-11 16:43:38 +01:00
Jim Mattson	f93a1a5bdc	KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRS [ Upstream commit 2e7eab81425ad6c875f2ed47c0ce01e78afc38a5 ] According to Intel's document on Indirect Branch Restricted Speculation, "Enabling IBRS does not prevent software from controlling the predicted targets of indirect branches of unrelated software executed later at the same predictor mode (for example, between two different user applications, or two different virtual machines). Such isolation can be ensured through use of the Indirect Branch Predictor Barrier (IBPB) command." This applies to both basic and enhanced IBRS. Since L1 and L2 VMs share hardware predictor modes (guest-user and guest-kernel), hardware IBRS is not sufficient to virtualize IBRS. (The way that basic IBRS is implemented on pre-eIBRS parts, hardware IBRS is actually sufficient in practice, even though it isn't sufficient architecturally.) For virtual CPUs that support IBRS, add an indirect branch prediction barrier on emulated VM-exit, to ensure that the predicted targets of indirect branches executed in L1 cannot be controlled by software that was executed in L2. Since we typically don't intercept guest writes to IA32_SPEC_CTRL, perform the IBPB at emulated VM-exit regardless of the current IA32_SPEC_CTRL.IBRS value, even though the IBPB could technically be deferred until L1 sets IA32_SPEC_CTRL.IBRS, if IA32_SPEC_CTRL.IBRS is clear at emulated VM-exit. This is CVE-2022-2196. Fixes: 5c911beff20a ("KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02") Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221019213620.1953281-3-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-02-25 11:53:26 +01:00
Sean Christopherson	db209f39f1	KVM: x86: Fail emulation during EMULTYPE_SKIP on any exception [ Upstream commit 17122c06b86c9f77f45b86b8e62c3ed440847a59 ] Treat any exception during instruction decode for EMULTYPE_SKIP as a "full" emulation failure, i.e. signal failure instead of queuing the exception. When decoding purely to skip an instruction, KVM and/or the CPU has already done some amount of emulation that cannot be unwound, e.g. on an EPT misconfig VM-Exit KVM has already processeed the emulated MMIO. KVM already does this if a #UD is encountered, but not for other exceptions, e.g. if a #PF is encountered during fetch. In SVM's soft-injection use case, queueing the exception is particularly problematic as queueing exceptions while injecting events can put KVM into an infinite loop due to bailing from VM-Enter to service the newly pending exception. E.g. multiple warnings to detect such behavior fire: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9873 kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm] Modules linked in: kvm_amd ccp kvm irqbypass CPU: 3 PID: 1017 Comm: svm_nested_soft Not tainted 6.0.0-rc1+ #220 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm] Call Trace: kvm_vcpu_ioctl+0x223/0x6d0 [kvm] __x64_sys_ioctl+0x85/0xc0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9987 kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm] Modules linked in: kvm_amd ccp kvm irqbypass CPU: 3 PID: 1017 Comm: svm_nested_soft Tainted: G W 6.0.0-rc1+ #220 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm] Call Trace: kvm_vcpu_ioctl+0x223/0x6d0 [kvm] __x64_sys_ioctl+0x85/0xc0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ---[ end trace 0000000000000000 ]--- Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220930233632.1725475-1-seanjc@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-02-25 11:53:26 +01:00
Greg Kroah-Hartman	9f95a161a7	kvm: initialize all of the kvm_debugregs structure before sending it to userspace commit 2c10b61421a28e95a46ab489fd56c0f442ff6952 upstream. When calling the KVM_GET_DEBUGREGS ioctl, on some configurations, there might be some unitialized portions of the kvm_debugregs structure that could be copied to userspace. Prevent this as is done in the other kvm ioctls, by setting the whole structure to 0 before copying anything into it. Bonus is that this reduces the lines of code as the explicit flag setting and reserved space zeroing out can be removed. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <x86@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: stable <stable@kernel.org> Reported-by: Xingyuan Mo <hdthky0@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Message-Id: <20230214103304.3689213-1-gregkh@linuxfoundation.org> Tested-by: Xingyuan Mo <hdthky0@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-02-22 12:50:41 +01:00
Eric W. Biederman	9a18c9c833	exit: Add and use make_task_dead. commit 0e25498f8cd43c1b5aa327f373dd094e9a006da7 upstream. There are two big uses of do_exit. The first is it's design use to be the guts of the exit(2) system call. The second use is to terminate a task after something catastrophic has happened like a NULL pointer in kernel code. Add a function make_task_dead that is initialy exactly the same as do_exit to cover the cases where do_exit is called to handle catastrophic failure. In time this can probably be reduced to just a light wrapper around do_task_dead. For now keep it exactly the same so that there will be no behavioral differences introducing this new concept. Replace all of the uses of do_exit that use it for catastraphic task cleanup with make_task_dead to make it clear what the code is doing. As part of this rename rewind_stack_do_exit rewind_stack_and_make_dead. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-02-06 07:52:49 +01:00
Mikulas Patocka	8667280a67	x86/asm: Fix an assembler warning with current binutils commit 55d235361fccef573990dfa5724ab453866e7816 upstream. Fix a warning: "found `movsd'; assuming `movsl' was meant" Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-02-06 07:52:48 +01:00
Colin Ian King	fbf7b0e4ce	perf/x86/amd: fix potential integer overflow on shift of a int commit 08245672cdc6505550d1a5020603b0a8d4a6dcc7 upstream. The left shift of int 32 bit integer constant 1 is evaluated using 32 bit arithmetic and then passed as a 64 bit function argument. In the case where i is 32 or more this can lead to an overflow. Avoid this by shifting using the BIT_ULL macro instead. Fixes: 471af006a747 ("perf/x86/amd: Constrain Large Increment per Cycle events") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ian Rogers <irogers@google.com> Acked-by: Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/20221202135149.1797974-1-colin.i.king@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-02-06 07:52:47 +01:00
Thomas Gleixner	8770cd9d7c	x86/i8259: Mark legacy PIC interrupts with IRQ_LEVEL commit 5fa55950729d0762a787451dc52862c3f850f859 upstream. Baoquan reported that after triggering a crash the subsequent crash-kernel fails to boot about half of the time. It triggers a NULL pointer dereference in the periodic tick code. This happens because the legacy timer interrupt (IRQ0) is resent in software which happens in soft interrupt (tasklet) context. In this context get_irq_regs() returns NULL which leads to the NULL pointer dereference. The reason for the resend is a spurious APIC interrupt on the IRQ0 vector which is captured and leads to a resend when the legacy timer interrupt is enabled. This is wrong because the legacy PIC interrupts are level triggered and therefore should never be resent in software, but nothing ever sets the IRQ_LEVEL flag on those interrupts, so the core code does not know about their trigger type. Ensure that IRQ_LEVEL is set when the legacy PCI interrupts are set up. Fixes: a4633adcdbc1 ("[PATCH] genirq: add genirq sw IRQ-retrigger") Reported-by: Baoquan He <bhe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Baoquan He <bhe@redhat.com> Link: https://lore.kernel.org/r/87mt6rjrra.ffs@tglx Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-02-06 07:52:47 +01:00
Hendrik Borghorst	aa33d334bd	KVM: x86/vmx: Do not skip segment attributes if unusable bit is set commit a44b331614e6f7e63902ed7dff7adc8c85edd8bc upstream. When serializing and deserializing kvm_sregs, attributes of the segment descriptors are stored by user space. For unusable segments, vmx_segment_access_rights skips all attributes and sets them to 0. This means we zero out the DPL (Descriptor Privilege Level) for unusable entries. Unusable segments are - contrary to their name - usable in 64bit mode and are used by guests to for example create a linear map through the NULL selector. VMENTER checks if SS.DPL is correct depending on the CS segment type. For types 9 (Execute Only) and 11 (Execute Read), CS.DPL must be equal to SS.DPL [1]. We have seen real world guests setting CS to a usable segment with DPL=3 and SS to an unusable segment with DPL=3. Once we go through an sregs get/set cycle, SS.DPL turns to 0. This causes the virtual machine to crash reproducibly. This commit changes the attribute logic to always preserve attributes for unusable segments. According to [2] SS.DPL is always saved on VM exits, regardless of the unusable bit so user space applications should have saved the information on serialization correctly. [3] specifies that besides SS.DPL the rest of the attributes of the descriptors are undefined after VM entry if unusable bit is set. So, there should be no harm in setting them all to the previous state. [1] Intel SDM Vol 3C 26.3.1.2 Checks on Guest Segment Registers [2] Intel SDM Vol 3C 27.3.2 Saving Segment Registers and Descriptor-Table Registers [3] Intel SDM Vol 3C 26.3.2.2 Loading Guest Segment Registers and Descriptor-Table Registers Cc: Alexander Graf <graf@amazon.de> Cc: stable@vger.kernel.org Signed-off-by: Hendrik Borghorst <hborghor@amazon.de> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Alexander Graf <graf@amazon.com> Message-Id: <20221114164823.69555-1-hborghor@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-02-06 07:52:44 +01:00
YingChi Long	7242fc8c2f	x86/fpu: Use _Alignof to avoid undefined behavior in TYPE_ALIGN commit 55228db2697c09abddcb9487c3d9fa5854a932cd upstream. WG14 N2350 specifies that it is an undefined behavior to have type definitions within offsetof", see https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2350.htm This specification is also part of C23. Therefore, replace the TYPE_ALIGN macro with the _Alignof builtin to avoid undefined behavior. (_Alignof itself is C11 and the kernel is built with -gnu11). ISO C11 _Alignof is subtly different from the GNU C extension __alignof__. Latter is the preferred alignment and _Alignof the minimal alignment. For long long on x86 these are 8 and 4 respectively. The macro TYPE_ALIGN's behavior matches _Alignof rather than __alignof__. [ bp: Massage commit message. ] Signed-off-by: YingChi Long <me@inclyc.cn> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20220925153151.2467884-1-me@inclyc.cn Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-24 07:18:01 +01:00
Peter Newman	3030f11f27	x86/resctrl: Fix task CLOSID/RMID update race [ Upstream commit fe1f0714385fbcf76b0cbceb02b7277d842014fc ] When the user moves a running task to a new rdtgroup using the task's file interface or by deleting its rdtgroup, the resulting change in CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the task(s) CPUs. x86 allows reordering loads with prior stores, so if the task starts running between a task_curr() check that the CPU hoisted before the stores in the CLOSID/RMID update then it can start running with the old CLOSID/RMID until it is switched again because __rdtgroup_move_task() failed to determine that it needs to be interrupted to obtain the new CLOSID/RMID. Refer to the diagram below: CPU 0 CPU 1 ----- ----- __rdtgroup_move_task(): curr <- t1->cpu->rq->curr __schedule(): rq->curr <- t1 resctrl_sched_in(): t1->{closid,rmid} -> {1,1} t1->{closid,rmid} <- {2,2} if (curr == t1) // false IPI(t1->cpu) A similar race impacts rdt_move_group_tasks(), which updates tasks in a deleted rdtgroup. In both cases, use smp_mb() to order the task_struct::{closid,rmid} stores before the loads in task_curr(). In particular, in the rdt_move_group_tasks() case, simply execute an smp_mb() on every iteration with a matching task. It is possible to use a single smp_mb() in rdt_move_group_tasks(), but this would require two passes and a means of remembering which task_structs were updated in the first loop. However, benchmarking results below showed too little performance impact in the simple approach to justify implementing the two-pass approach. Times below were collected using `perf stat` to measure the time to remove a group containing a 1600-task, parallel workload. CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads) # mkdir /sys/fs/resctrl/test # echo $$ > /sys/fs/resctrl/test/tasks # perf bench sched messaging -g 40 -l 100000 task-clock time ranges collected using: # perf stat rmdir /sys/fs/resctrl/test Baseline: 1.54 - 1.60 ms smp_mb() every matching task: 1.57 - 1.67 ms [ bp: Massage commit message. ] Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR") Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount") Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:42:05 +01:00
Reinette Chatre	22c4eeafc3	x86/resctrl: Use task_curr() instead of task_struct->on_cpu to prevent unnecessary IPI [ Upstream commit e0ad6dc8969f790f14bddcfd7ea284b7e5f88a16 ] James reported in [1] that there could be two tasks running on the same CPU with task_struct->on_cpu set. Using task_struct->on_cpu as a test if a task is running on a CPU may thus match the old task for a CPU while the scheduler is running and IPI it unnecessarily. task_curr() is the correct helper to use. While doing so move the #ifdef check of the CONFIG_SMP symbol to be a C conditional used to determine if this helper should be used to ensure the code is always checked for correctness by the compiler. [1] https://lore.kernel.org/lkml/a782d2f3-d2f6-795f-f4b1-9462205fd581@arm.com Reported-by: James Morse <james.morse@arm.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/e9e68ce1441a73401e08b641cc3b9a3cf13fe6d4.1608243147.git.reinette.chatre@intel.com Stable-dep-of: fe1f0714385f ("x86/resctrl: Fix task CLOSID/RMID update race") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:42:05 +01:00
Peter Zijlstra	a5b737623e	x86/boot: Avoid using Intel mnemonics in AT&T syntax asm commit 7c6dd961d0c8e7e8f9fdc65071fb09ece702e18d upstream. With 'GNU assembler (GNU Binutils for Debian) 2.39.90.20221231' the build now reports: arch/x86/realmode/rm/../../boot/bioscall.S: Assembler messages: arch/x86/realmode/rm/../../boot/bioscall.S:35: Warning: found `movsd'; assuming `movsl' was meant arch/x86/realmode/rm/../../boot/bioscall.S:70: Warning: found `movsd'; assuming `movsl' was meant arch/x86/boot/bioscall.S: Assembler messages: arch/x86/boot/bioscall.S:35: Warning: found `movsd'; assuming `movsl' was meant arch/x86/boot/bioscall.S:70: Warning: found `movsd'; assuming `movsl' was meant Which is due to: PR gas/29525 Note that with the dropped CMPSD and MOVSD Intel Syntax string insn templates taking operands, mixed IsString/non-IsString template groups (with memory operands) cannot occur anymore. With that maybe_adjust_templates() becomes unnecessary (and is hence being removed). More details: https://sourceware.org/bugzilla/show_bug.cgi?id=29525 Borislav Petkov further explains: " the particular problem here is is that the 'd' suffix is "conflicting" in the sense that you can have SSE mnemonics like movsD %xmm... and the same thing also for string ops (which is the case here) so apparently the agreement in binutils land is to use the always accepted suffixes 'l' or 'q' and phase out 'd' slowly... " Fixes: 7a734e7dd93b ("x86, setup: "glove box" BIOS calls -- infrastructure") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/Y71I3Ex2pvIxMpsP@hirez.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-18 11:42:03 +01:00
Rodrigo Branco	8cbd7f2643	x86/bugs: Flush IBP in ib_prctl_set() commit a664ec9158eeddd75121d39c9a0758016097fa96 upstream. We missed the window between the TIF flag update and the next reschedule. Signed-off-by: Rodrigo Branco <bsdaemon@google.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-18 11:41:59 +01:00
Sean Christopherson	0ebcfdc8c9	KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1 [ Upstream commit 31de69f4eea77b28a9724b3fa55aae104fc91fc7 ] Set ENABLE_USR_WAIT_PAUSE in KVM's supported VMX MSR configuration if the feature is supported in hardware and enabled in KVM's base, non-nested configuration, i.e. expose ENABLE_USR_WAIT_PAUSE to L1 if it's supported. This fixes a bug where saving/restoring, i.e. migrating, a vCPU will fail if WAITPKG (the associated CPUID feature) is enabled for the vCPU, and obviously allows L1 to enable the feature for L2. KVM already effectively exposes ENABLE_USR_WAIT_PAUSE to L1 by stuffing the allowed-1 control ina vCPU's virtual MSR_IA32_VMX_PROCBASED_CTLS2 when updating secondary controls in response to KVM_SET_CPUID(2), but (a) that depends on flawed code (KVM shouldn't touch VMX MSRs in response to CPUID updates) and (b) runs afoul of vmx_restore_control_msr()'s restriction that the guest value must be a strict subset of the supported host value. Although no past commit explicitly enabled nested support for WAITPKG, doing so is safe and functionally correct from an architectural perspective as no additional KVM support is needed to virtualize TPAUSE, UMONITOR, and UMWAIT for L2 relative to L1, and KVM already forwards VM-Exits to L1 as necessary (commit bf653b78f960, "KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit"). Note, KVM always keeps the hosts MSR_IA32_UMWAIT_CONTROL resident in hardware, i.e. always runs both L1 and L2 with the host's power management settings for TPAUSE and UMWAIT. See commit bf09fb6cba4f ("KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL") for more details. Fixes: e69e72faa3a0 ("KVM: x86: Add support for user wait instructions") Cc: stable@vger.kernel.org Reported-by: Aaron Lewis <aaronlewis@google.com> Reported-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20221213062306.667649-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:41:54 +01:00
Xiaoyao Li	e723bafd8f	KVM: VMX: Fix the spelling of CPU_BASED_USE_TSC_OFFSETTING [ Upstream commit 5e3d394fdd9e6b49cd8b28d85adff100a5bddc66 ] The mis-spelling is found by checkpatch.pl, so fix them. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Stable-dep-of: 31de69f4eea7 ("KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:41:54 +01:00
Xiaoyao Li	7290669045	KVM: VMX: Rename NMI_PENDING to NMI_WINDOW [ Upstream commit 4e2a0bc56ad197e5ccfab8395649b681067fe8cb ] Rename the NMI-window exiting related definitions to match the latest Intel SDM. No functional changes. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Stable-dep-of: 31de69f4eea7 ("KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:41:54 +01:00
Xiaoyao Li	da8ff59210	KVM: VMX: Rename INTERRUPT_PENDING to INTERRUPT_WINDOW [ Upstream commit 9dadc2f918df26e64aa04794cdb4d8667c934f47 ] Rename interrupt-windown exiting related definitions to match the latest Intel SDM. No functional changes. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Stable-dep-of: 31de69f4eea7 ("KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:41:54 +01:00
Andrea Arcangeli	db99c8d6b7	KVM: retpolines: x86: eliminate retpoline from vmx.c exit handlers [ Upstream commit 4289d2728664fc1fb49cfc76a6a7d96d913b921f ] It's enough to check the exit value and issue a direct call to avoid the retpoline for all the common vmexit reasons. Of course CONFIG_RETPOLINE already forbids gcc to use indirect jumps while compiling all switch() statements, however switch() would still allow the compiler to bisect the case value. It's more efficient to prioritize the most frequent vmexits instead. The halt may be slow paths from the point of the guest, but not necessarily so from the point of the host if the host runs at full CPU capacity and no host CPU is ever left idle. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Stable-dep-of: 31de69f4eea7 ("KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:41:54 +01:00
Andrea Arcangeli	2c82f134b9	KVM: x86: optimize more exit handlers in vmx.c [ Upstream commit f399e60c45f6b6e6ad6dfcedff1dd6386e086b0b ] Eliminate wasteful call/ret non RETPOLINE case and unnecessary fentry dynamic tracing hooking points. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Stable-dep-of: 31de69f4eea7 ("KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:41:53 +01:00
Ashok Raj	f5489d5a24	x86/microcode/intel: Do not retry microcode reloading on the APs commit be1b670f61443aa5d0d01782e9b8ea0ee825d018 upstream. The retries in load_ucode_intel_ap() were in place to support systems with mixed steppings. Mixed steppings are no longer supported and there is only one microcode image at a time. Any retries will simply reattempt to apply the same image over and over without making progress. [ bp: Zap the circumstantial reasoning from the commit message. ] Fixes: 06b8534cb728 ("x86/microcode: Rework microcode loading") Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221129210832.107850-3-ashok.raj@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-18 11:41:48 +01:00
Eric W. Biederman	33933af45d	binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf [ Upstream commit e7f7785449a1f459a4a3ca92f82f56fb054dd2b9 ] In 2016 Linus moved install_exec_creds immediately after setup_new_exec, in binfmt_elf as a cleanup and as part of closing a potential information leak. Perform the same cleanup for the other binary formats. Different binary formats doing the same things the same way makes exec easier to reason about and easier to maintain. Greg Ungerer reports: > I tested the the whole series on non-MMU m68k and non-MMU arm > (exercising binfmt_flat) and it all tested out with no problems, > so for the binfmt_flat changes: Tested-by: Greg Ungerer <gerg@linux-m68k.org> Ref: 9f834ec18def ("binfmt_elf: switch to new creds when switching to new mm") Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Stable-dep-of: e7f703ff2507 ("binfmt: Fix error return code in load_elf_fdpic_binary()") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:41:46 +01:00
Xiu Jianfeng	29198f667f	x86/xen: Fix memory leak in xen_init_lock_cpu() [ Upstream commit ca84ce153d887b1dc8b118029976cc9faf2a9b40 ] In xen_init_lock_cpu(), the @name has allocated new string by kasprintf(), if bind_ipi_to_irqhandler() fails, it should be freed, otherwise may lead to a memory leak issue, fix it. Fixes: 2d9e1e2f58b5 ("xen: implement Xen-specific spinlocks") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20221123155858.11382-3-xiujianfeng@huawei.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:40:57 +01:00
Xiu Jianfeng	ec88254208	x86/xen: Fix memory leak in xen_smp_intr_init{_pv}() [ Upstream commit 69143f60868b3939ddc89289b29db593b647295e ] These local variables @{resched\|pmu\|callfunc...}_name saves the new string allocated by kasprintf(), and when bind_{v}ipi_to_irqhandler() fails, it goes to the @fail tag, and calls xen_smp_intr_free{_pv}() to free resource, however the new string is not saved, which cause a memory leak issue. fix it. Fixes: 9702785a747a ("i386: move xen") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20221123155858.11382-2-xiujianfeng@huawei.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:40:57 +01:00
Juergen Gross	6e98158d97	xen/events: only register debug interrupt for 2-level events [ Upstream commit d04b1ae5a9b0c868dda8b4b34175ef08f3cb9e93 ] xen_debug_interrupt() is specific to 2-level event handling. So don't register it with fifo event handling being active. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Link: https://lore.kernel.org/r/20201022094907.28560-4-jgross@suse.com Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Stable-dep-of: 69143f60868b ("x86/xen: Fix memory leak in xen_smp_intr_init{_pv}()") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:40:57 +01:00
Oleg Nesterov	314d510535	uprobes/x86: Allow to probe a NOP instruction with 0x66 prefix [ Upstream commit cefa72129e45313655d53a065b8055aaeb01a0c9 ] Intel ICC -hotpatch inserts 2-byte "0x66 0x90" NOP at the start of each function to reserve extra space for hot-patching, and currently it is not possible to probe these functions because branch_setup_xol_ops() wrongly rejects NOP with REP prefix as it treats them like word-sized branch instructions. Fixes: 250bbd12c2fe ("uprobes/x86: Refuse to attach uprobe to "word-sized" branch insns") Reported-by: Seiji Nishikawa <snishika@redhat.com> Suggested-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20221204173933.GA31544@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:40:57 +01:00
Xiongfeng Wang	e293263248	perf/x86/intel/uncore: Fix reference count leak in hswep_has_limit_sbox() [ Upstream commit 1ff9dd6e7071a561f803135c1d684b13c7a7d01d ] pci_get_device() will increase the reference count for the returned 'dev'. We need to call pci_dev_put() to decrease the reference count. Since 'dev' is only used in pci_read_config_dword(), let's add pci_dev_put() right after it. Fixes: 9d480158ee86 ("perf/x86/intel/uncore: Remove uncore extra PCI dev HSWEP_PCI_PCU_3") Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20221118063137.121512-3-wangxiongfeng2@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-18 11:40:54 +01:00
Paul E. McKenney	69d4f3baa6	x86/smpboot: Move rcu_cpu_starting() earlier commit 29368e09392123800e5e2bf0f3eda91f16972e52 upstream. The call to rcu_cpu_starting() in mtrr_ap_init() is not early enough in the CPU-hotplug onlining process, which results in lockdep splats as follows: ============================= WARNING: suspicious RCU usage 5.9.0+ #268 Not tainted ----------------------------- kernel/kprobes.c:300 RCU-list traversed in non-reader section!! other info that might help us debug this: RCU used illegally from offline CPU! rcu_scheduler_active = 1, debug_locks = 1 no locks held by swapper/1/0. stack backtrace: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0+ #268 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x77/0x97 __is_insn_slot_addr+0x15d/0x170 kernel_text_address+0xba/0xe0 ? get_stack_info+0x22/0xa0 __kernel_text_address+0x9/0x30 show_trace_log_lvl+0x17d/0x380 ? dump_stack+0x77/0x97 dump_stack+0x77/0x97 __lock_acquire+0xdf7/0x1bf0 lock_acquire+0x258/0x3d0 ? vprintk_emit+0x6d/0x2c0 _raw_spin_lock+0x27/0x40 ? vprintk_emit+0x6d/0x2c0 vprintk_emit+0x6d/0x2c0 printk+0x4d/0x69 start_secondary+0x1c/0x100 secondary_startup_64_no_verify+0xb8/0xbb This is avoided by moving the call to rcu_cpu_starting up near the beginning of the start_secondary() function. Note that the raw_smp_processor_id() is required in order to avoid calling into lockdep before RCU has declared the CPU to be watched for readers. Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/ Reported-by: Qian Cai <cai@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-12-19 12:24:15 +01:00
Michael Kelley	e67e119adf	x86/ioremap: Fix page aligned size calculation in __ioremap_caller() [ Upstream commit 4dbd6a3e90e03130973688fd79e19425f720d999 ] Current code re-calculates the size after aligning the starting and ending physical addresses on a page boundary. But the re-calculation also embeds the masking of high order bits that exceed the size of the physical address space (via PHYSICAL_PAGE_MASK). If the masking removes any high order bits, the size calculation results in a huge value that is likely to immediately fail. Fix this by re-calculating the page-aligned size first. Then mask any high order bits using PHYSICAL_PAGE_MASK. Fixes: ffa71f33a820 ("x86, ioremap: Fix incorrect physical address handling in PAE mode") Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/1668624097-14884-2-git-send-email-mikelley@microsoft.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-12-08 11:23:06 +01:00
Pawan Gupta	b5041a3daa	x86/pm: Add enumeration check before spec MSRs save/restore setup commit 50bcceb7724e471d9b591803889df45dcbb584bc upstream. pm_save_spec_msr() keeps a list of all the MSRs which _might_ need to be saved and restored at hibernate and resume. However, it has zero awareness of CPU support for these MSRs. It mostly works by unconditionally attempting to manipulate these MSRs and relying on rdmsrl_safe() being able to handle a #GP on CPUs where the support is unavailable. However, it's possible for reads (RDMSR) to be supported for a given MSR while writes (WRMSR) are not. In this case, msr_build_context() sees a successful read (RDMSR) and marks the MSR as valid. Then, later, a write (WRMSR) fails, producing a nasty (but harmless) error message. This causes restore_processor_state() to try and restore it, but writing this MSR is not allowed on the Intel Atom N2600 leading to: unchecked MSR access error: WRMSR to 0x122 (tried to write 0x0000000000000002) \ at rIP: 0xffffffff8b07a574 (native_write_msr+0x4/0x20) Call Trace: <TASK> restore_processor_state x86_acpi_suspend_lowlevel acpi_suspend_enter suspend_devices_and_enter pm_suspend.cold state_store kernfs_fop_write_iter vfs_write ksys_write do_syscall_64 ? do_syscall_64 ? up_read ? lock_is_held_type ? asm_exc_page_fault ? lockdep_hardirqs_on entry_SYSCALL_64_after_hwframe To fix this, add the corresponding X86_FEATURE bit for each MSR. Avoid trying to manipulate the MSR when the feature bit is clear. This required adding a X86_FEATURE bit for MSRs that do not have one already, but it's a small price to pay. [ bp: Move struct msr_enumeration inside the only function that uses it. ] [Pawan: Resolve build issue in backport] Fixes: 73924ec4d560 ("x86/pm: Save the MSR validity status at context setup") Reported-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/c24db75d69df6e66c0465e13676ad3f2837a2ed8.1668539735.git.pawan.kumar.gupta@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-12-08 11:23:05 +01:00
Pawan Gupta	3b28594576	x86/tsx: Add a feature bit for TSX control MSR support commit aaa65d17eec372c6a9756833f3964ba05b05ea14 upstream. Support for the TSX control MSR is enumerated in MSR_IA32_ARCH_CAPABILITIES. This is different from how other CPU features are enumerated i.e. via CPUID. Currently, a call to tsx_ctrl_is_supported() is required for enumerating the feature. In the absence of a feature bit for TSX control, any code that relies on checking feature bits directly will not work. In preparation for adding a feature bit check in MSR save/restore during suspend/resume, set a new feature bit X86_FEATURE_TSX_CTRL when MSR_IA32_TSX_CTRL is present. [ bp: Remove tsx_ctrl_is_supported()] [Pawan: Resolved conflicts in backport; Removed parts of commit message referring to removed function tsx_ctrl_is_supported()] Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/de619764e1d98afbb7a5fa58424f1278ede37b45.1668539735.git.pawan.kumar.gupta@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-12-08 11:23:05 +01:00

1 2 3 4 5 ...

34118 Commits