IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 9b8493dc43044376716d789d07699f17d538a7c4 upstream.
Commit in Fixes added an AMD-specific microcode callback. However, it
didn't check the CPU vendor the kernel runs on explicitly.
The only reason the Zenbleed check in it didn't run on other x86 vendors
hardware was pure coincidental luck:
if (!cpu_has_amd_erratum(c, amd_zenbleed))
return;
gives true on other vendors because they don't have those families and
models.
However, with the removal of the cpu_has_amd_erratum() in
05f5f73936fa ("x86/CPU/AMD: Drop now unused CPU erratum checking function")
that coincidental condition is gone, leading to the zenbleed check
getting executed on other vendors too.
Add the explicit vendor check for the whole callback as it should've
been done in the first place.
Fixes: 522b1d69219d ("x86/cpu/amd: Add a Zenbleed fix")
Cc: <stable@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231201184226.16749-1-bp@alien8.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 27d25348d42161837be08fc63b04a2559d2e781c ]
A write-access violation page fault kernel crash was observed while running
cpuhotplug LTP testcases on SEV-ES enabled systems. The crash was
observed during hotplug, after the CPU was offlined and the process
was migrated to different CPU. setup_ghcb() is called again which
tries to update ghcb_version in sev_es_negotiate_protocol(). Ideally this
is a read_only variable which is initialised during booting.
Trying to write it results in a pagefault:
BUG: unable to handle page fault for address: ffffffffba556e70
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
[ ...]
Call Trace:
<TASK>
? __die_body.cold+0x1a/0x1f
? __die+0x2a/0x35
? page_fault_oops+0x10c/0x270
? setup_ghcb+0x71/0x100
? __x86_return_thunk+0x5/0x6
? search_exception_tables+0x60/0x70
? __x86_return_thunk+0x5/0x6
? fixup_exception+0x27/0x320
? kernelmode_fixup_or_oops+0xa2/0x120
? __bad_area_nosemaphore+0x16a/0x1b0
? kernel_exc_vmm_communication+0x60/0xb0
? bad_area_nosemaphore+0x16/0x20
? do_kern_addr_fault+0x7a/0x90
? exc_page_fault+0xbd/0x160
? asm_exc_page_fault+0x27/0x30
? setup_ghcb+0x71/0x100
? setup_ghcb+0xe/0x100
cpu_init_exception_handling+0x1b9/0x1f0
The fix is to call sev_es_negotiate_protocol() only in the BSP boot phase,
and it only needs to be done once in any case.
[ mingo: Refined the changelog. ]
Fixes: 95d33bfaa3e1 ("x86/sev: Register GHCB memory when SEV-SNP is active")
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Bo Gan <bo.gan@broadcom.com>
Signed-off-by: Bo Gan <bo.gan@broadcom.com>
Signed-off-by: Ashwin Dayanand Kamat <ashwin.kamat@broadcom.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/1701254429-18250-1-git-send-email-kashwindayan@vmware.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ upstream commit be5341eb0d43b1e754799498bd2e8756cc167a41 ]
There is no real reason to have a separate ASM entry point implementation
for the legacy INT 0x80 syscall emulation on 64-bit.
IDTENTRY provides all the functionality needed with the only difference
that it does not:
- save the syscall number (AX) into pt_regs::orig_ax
- set pt_regs::ax to -ENOSYS
Both can be done safely in the C code of an IDTENTRY before invoking any of
the syscall related functions which depend on this convention.
Aside of ASM code reduction this prepares for detecting and handling a
local APIC injected vector 0x80.
[ kirill.shutemov: More verbose comments ]
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@vger.kernel.org> # v6.0+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ee545b94d39a00c93dc98b1dbcbcf731d2eadeb4 upstream.
Hygon processors with a model ID > 3 have CPUID leaf 0xB correctly
populated and don't need the fixed package ID shift workaround. The fixup
is also incorrect when running in a guest.
Fixes: e0ceeae708ce ("x86/CPU/hygon: Fix phys_proc_id calculation logic for multi-die processors")
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/tencent_594804A808BD93A4EBF50A994F228E3A7F07@qq.com
Link: https://lore.kernel.org/r/20230814085112.089607918@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b56ebe7c896dc78b5865ec2c4b1dae3c93537517 upstream.
commit ef8dd01538ea ("genirq/msi: Make interrupt allocation less
convoluted"), reworked the code so that the x86 specific quirk for affinity
setting of non-maskable PCI/MSI interrupts is not longer activated if
necessary.
This could be solved by restoring the original logic in the core MSI code,
but after a deeper analysis it turned out that the quirk flag is not
required at all.
The quirk is only required when the PCI/MSI device cannot mask the MSI
interrupts, which in turn also prevents reservation mode from being enabled
for the affected interrupt.
This allows ot remove the NOMASK quirk bit completely as msi_set_affinity()
can instead check whether reservation mode is enabled for the interrupt,
which gives exactly the same answer.
Even in the momentary non-existing case that the reservation mode would be
not set for a maskable MSI interrupt this would not cause any harm as it
just would cause msi_set_affinity() to go needlessly through the
functionaly equivalent slow path, which works perfectly fine with maskable
interrupts as well.
Rework msi_set_affinity() to query the reservation mode and remove all
NOMASK quirk logic from the core code.
[ tglx: Massaged changelog ]
Fixes: ef8dd01538ea ("genirq/msi: Make interrupt allocation less convoluted")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20231026032036.2462428-1-den@valinux.co.jp
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 31255e072b2e91f97645d792d25b2db744186dd1 upstream.
When a signal is being delivered, the kernel needs to make accesses to
userspace. These accesses could encounter an access error, in which case
the signal delivery itself will trigger a segfault. Usually this would
result in the kernel killing the process. But in the case of a SEGV signal
handler being configured, the failure of the first signal delivery will
result in *another* signal getting delivered. The second signal may
succeed if another thread has resolved the issue that triggered the
segfault (i.e. a well timed mprotect()/mmap()), or the second signal is
being delivered to another stack (i.e. an alt stack).
On x86, in the non-shadow stack case, all the accesses to userspace are
done before changes to the registers (in pt_regs). The operation is
aborted when an access error occurs, so although there may be writes done
for the first signal, control flow changes for the signal (regs->ip,
regs->sp, etc) are not committed until all the accesses have already
completed successfully. This means that the second signal will be
delivered as if it happened at the time of the first signal. It will
effectively replace the first aborted signal, overwriting the half-written
frame of the aborted signal. So on sigreturn from the second signal,
control flow will resume happily from the point of control flow where the
original signal was delivered.
The problem is, when shadow stack is active, the shadow stack SSP
register/MSR is updated *before* some of the userspace accesses. This
means if the earlier accesses succeed and the later ones fail, the second
signal will not be delivered at the same spot on the shadow stack as the
first one. So on sigreturn from the second signal, the SSP will be
pointing to the wrong location on the shadow stack (off by a frame).
Pengfei privately reported that while using a shadow stack enabled glibc,
the “signal06” test in the LTP test-suite hung. It turns out it is
testing the above described double signal scenario. When this test was
compiled with shadow stack, the first signal pushed a shadow stack
sigframe, then the second pushed another. When the second signal was
handled, the SSP was at the first shadow stack signal frame instead of
the original location. The test then got stuck as the #CP from the twice
incremented SSP was incorrect and generated segfaults in a loop.
Fix this by adjusting the SSP register only after any userspace accesses,
such that there can be no failures after the SSP is adjusted. Do this by
moving the shadow stack sigframe push logic to happen after all other
userspace accesses.
Note, sigreturn (as opposed to the signal delivery dealt with in this
patch) has ordering behavior that could lead to similar failures. The
ordering issues there extend beyond shadow stack to include the alt stack
restoration. Fixing that would require cross-arch changes, and the
ordering today does not cause any known test or apps breakages. So leave
it as is, for now.
[ dhansen: minor changelog/subject tweak ]
Fixes: 05e36022c054 ("x86/shstk: Handle signals for shadow stack")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20231107182251.91276-1-rick.p.edgecombe%40intel.com
Link: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/signal/signal06.c
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2a565258b3f4bbdc7a3c09cd02082cb286a7bffc upstream.
Three PCI IDs for DF Function 4 were defined but not used.
Add them to the "link" list.
Fixes: f8faf3496633 ("x86/amd_nb: Add AMD PCI IDs for SMN communication")
Fixes: 23a5b8bb022c ("x86/amd_nb: Add PCI ID for family 19h model 78h")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230803150430.3542854-1-yazen.ghannam@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f44075ecafb726830e63d33fbca29413149eeeb8 ]
The ->idt_seq and ->recv_jiffies variables added by:
1a3ea611fc10 ("x86/nmi: Accumulate NMI-progress evidence in exc_nmi()")
... place the exit-time check of the bottom bit of ->idt_seq after the
this_cpu_dec_return() that re-enables NMI nesting. This can result in
the following sequence of events on a given CPU in kernels built with
CONFIG_NMI_CHECK_CPU=y:
o An NMI arrives, and ->idt_seq is incremented to an odd number.
In addition, nmi_state is set to NMI_EXECUTING==1.
o The NMI is processed.
o The this_cpu_dec_return(nmi_state) zeroes nmi_state and returns
NMI_EXECUTING==1, thus opting out of the "goto nmi_restart".
o Another NMI arrives and ->idt_seq is incremented to an even
number, triggering the warning. But all is just fine, at least
assuming we don't get so many closely spaced NMIs that the stack
overflows or some such.
Experience on the fleet indicates that the MTBF of this false positive
is about 70 years. Or, for those who are not quite that patient, the
MTBF appears to be about one per week per 4,000 systems.
Fix this false-positive warning by moving the "nmi_restart" label before
the initial ->idt_seq increment/check and moving the this_cpu_dec_return()
to follow the final ->idt_seq increment/check. This way, all nested NMIs
that get past the NMI_NOT_RUNNING check get a clean ->idt_seq slate.
And if they don't get past that check, they will set nmi_state to
NMI_LATCHED, which will cause the this_cpu_dec_return(nmi_state)
to restart.
Fixes: 1a3ea611fc10 ("x86/nmi: Accumulate NMI-progress evidence in exc_nmi()")
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/0cbff831-6e3d-431c-9830-ee65ee7787ff@paulmck-laptop
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 965e05ff8af98c44f9937366715c512000373164 ]
The SMT control mechanism got added as speculation attack vector
mitigation. The implemented logic relies on the primary thread mask to
be set up properly.
This turns out to be an issue with XEN/PV guests because their CPU hotplug
mechanics do not enumerate APICs and therefore the mask is never correctly
populated.
This went unnoticed so far because by chance XEN/PV ends up with
smp_num_siblings == 2. So cpu_smt_control stays at its default value
CPU_SMT_ENABLED and the primary thread mask is never evaluated in the
context of CPU hotplug.
This stopped "working" with the upcoming overhaul of the topology
evaluation which legitimately provides a fake topology for XEN/PV. That
sets smp_num_siblings to 1, which causes the core CPU hot-plug core to
refuse to bring up the APs.
This happens because cpu_smt_control is set to CPU_SMT_NOT_SUPPORTED which
causes cpu_bootable() to evaluate the unpopulated primary thread mask with
the conclusion that all non-boot CPUs are not valid to be plugged.
The core code has already been made more robust against this kind of fail,
but the primary thread mask really wants to be populated to avoid other
issues all over the place.
Just fake the mask by pretending that all XEN/PV vCPUs are primary threads,
which is consistent because all of XEN/PVs topology is fake or non-existent.
Fixes: 6a4d2657e048 ("x86/smp: Provide topology_is_primary_thread()")
Fixes: f54d4434c281 ("x86/apic: Provide cpu_primary_thread mask")
Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.210011520@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 001470fed5959d01faecbd57fcf2f60294da0de1 ]
Since the size value is added to the base address to yield the last valid
byte address of the GDT, the current size value of startup_gdt_descr is
incorrect (too large by one), fix it.
[ mingo: This probably never mattered, because startup_gdt[] is only used
in a very controlled fashion - but make it consistent nevertheless. ]
Fixes: 866b556efa12 ("x86/head/64: Install startup GDT")
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20230807084547.217390-1-ytcoode@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dc6306ad5b0dda040baf1fde3cfd458e6abfc4da ]
The SRSO default safe-ret mitigation is reported as "mitigated" even if
microcode hasn't been updated. That's wrong because userspace may still
be vulnerable to SRSO attacks due to IBPB not flushing branch type
predictions.
Report the safe-ret + !microcode case as vulnerable.
Also report the microcode-only case as vulnerable as it leaves the
kernel open to attacks.
Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/a8a14f97d1b0e03ec255c81637afdf4cf0ae9c99.1693889988.git.jpoimboe@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1d1142ac51307145dbb256ac3535a1d43a1c9800 ]
Make the SBPB check more robust against the (possible) case where future
HW has SRSO fixed but doesn't have the SRSO_NO bit set.
Fixes: 1b5277c0ea0b ("x86/srso: Add SRSO_NO support")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/cee5050db750b391c9f35f5334f8ff40e66c01b9.1693889988.git.jpoimboe@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
- Fix a possible CPU hotplug deadlock bug caused by the new
TSC synchronization code.
- Fix a legacy PIC discovery bug that results in device troubles on
affected systems, such as non-working keybards, etc.
- Add a new Intel CPU model number to <asm/intel-family.h>.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU85uARHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jSrhAAr3FrzkDWqLUUuDMWFEEVkTGz9GLc2uMP
bGT7HApp301wGpFv35bbBVp7HKLSB9nFeLsWKQF6MBa2uSUiCxSBcSNfHVJ6jX1O
Ac4hgl0Q/tZl0yZoag3u4j3FS57d9Qkzpg0QLTz9MVnQXEPTcDkli9tfSbRpxx3B
WvcGwAD+VeMjeghIJHxTJW9KwouIf3gAzqoKE7TKEozJ4S2LbNs3nEsDGsRHJsPN
jy/Fxlny6pdlkJArP/ILA9WO3yqBn2doIN9oqWKQCQtYbt1USnuJb5oCOXYAOijG
A9iLJmMjDyN2TZ8lKYHBcHVmEVTeB+6VnikejcAaT4EMrArj5mdjL3K7McZ24NFh
s8mx/S+v4mBExwNsaoq8kuUEZySWxNbFPLBjOq8BKIZYYDQUp1NK58CRWv9BAB6y
GjgYlMYI6EGDb+QAoTKZ5KNqrwtUPuk2ijjJTB15DpgLkl7XmyoVOulrtVDrhuGZ
Uuge9h0CLKnqD7lLfJdI7NLYxmZxPiQKbL9vrIqlk989UM5ItvttxWBNzhAdSTsG
VIaXAOJgGDj74dJC+3b0umugKJcrycUFBy5/RxOg+OcnSPd5g/L4XWcLLk+OElE3
pFBetWmO2HU4pxhL1GdWsDuF+RkYug4XKQqQP2LJ8Zepff/jkF+hVJoLYUhQIF6A
OMQv1N6/nSg=
=k19e
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix a possible CPU hotplug deadlock bug caused by the new TSC
synchronization code
- Fix a legacy PIC discovery bug that results in device troubles on
affected systems, such as non-working keybards, etc
- Add a new Intel CPU model number to <asm/intel-family.h>
* tag 'x86-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Defer marking TSC unstable to a worker
x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility
x86/cpu: Add model number for Intel Arrow Lake mobile processor
Tetsuo reported the following lockdep splat when the TSC synchronization
fails during CPU hotplug:
tsc: Marking TSC unstable due to check_tsc_sync_source failed
WARNING: inconsistent lock state
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
ffffffff8cfa1c78 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x23/0x5a0
{IN-HARDIRQ-W} state was registered at:
_raw_spin_lock_irqsave+0x3f/0x60
clocksource_mark_unstable+0x1b/0x90
mark_tsc_unstable+0x41/0x50
check_tsc_sync_source+0x14f/0x180
sysvec_call_function_single+0x69/0x90
Possible unsafe locking scenario:
lock(watchdog_lock);
<Interrupt>
lock(watchdog_lock);
stack backtrace:
_raw_spin_lock+0x30/0x40
clocksource_watchdog+0x23/0x5a0
run_timer_softirq+0x2a/0x50
sysvec_apic_timer_interrupt+0x6e/0x90
The reason is the recent conversion of the TSC synchronization function
during CPU hotplug on the control CPU to a SMP function call. In case
that the synchronization with the upcoming CPU fails, the TSC has to be
marked unstable via clocksource_mark_unstable().
clocksource_mark_unstable() acquires 'watchdog_lock', but that lock is
taken with interrupts enabled in the watchdog timer callback to minimize
interrupt disabled time. That's obviously a possible deadlock scenario,
Before that change the synchronization function was invoked in thread
context so this could not happen.
As it is not crucical whether the unstable marking happens slightly
delayed, defer the call to a worker thread which avoids the lock context
problem.
Fixes: 9d349d47f0e3 ("x86/smpboot: Make TSC synchronization function call based")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87zg064ceg.ffs@tglx
David and a few others reported that on certain newer systems some legacy
interrupts fail to work correctly.
Debugging revealed that the BIOS of these systems leaves the legacy PIC in
uninitialized state which makes the PIC detection fail and the kernel
switches to a dummy implementation.
Unfortunately this fallback causes quite some code to fail as it depends on
checks for the number of legacy PIC interrupts or the availability of the
real PIC.
In theory there is no reason to use the PIC on any modern system when
IO/APIC is available, but the dependencies on the related checks cannot be
resolved trivially and on short notice. This needs lots of analysis and
rework.
The PIC detection has been added to avoid quirky checks and force selection
of the dummy implementation all over the place, especially in VM guest
scenarios. So it's not an option to revert the relevant commit as that
would break a lot of other scenarios.
One solution would be to try to initialize the PIC on detection fail and
retry the detection, but that puts the burden on everything which does not
have a PIC.
Fortunately the ACPI/MADT table header has a flag field, which advertises
in bit 0 that the system is PCAT compatible, which means it has a legacy
8259 PIC.
Evaluate that bit and if set avoid the detection routine and keep the real
PIC installed, which then gets initialized (for nothing) and makes the rest
of the code with all the dependencies work again.
Fixes: e179f6914152 ("x86, irq, pic: Probe for legacy PIC and set legacy_pic appropriately")
Reported-by: David Lazar <dlazar@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: David Lazar <dlazar@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Cc: stable@vger.kernel.org
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218003
Link: https://lore.kernel.org/r/875y2u5s8g.ffs@tglx
the guest kernel gets to emulate certain instructions in SEV-{ES,SNP}
guests by:
- disabling emulation of MMIO instructions when coming from user mode
- checking the IO permission bitmap before emulating IO instructions and
verifying the memory operands of INS/OUTS insns.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmUufCwACgkQEsHwGGHe
VUoTHA//YO81VH8JkvfKwxh322mbD+TDTkgWgcpClsWnkIZQdyCpKVTwsWWuhwX5
FmCEc3I75hRK3ts3sdhZYOS94gKVUyWf2ERm2qMD02+08tS3K/TxJyx5xBMz9U03
VOiWRC1rp33MZ0eCrXenTbA7Xay6AhU34pz4qSdEvkUKUU6YIdCfnspFXSi84Uqy
tgmyPDJhSH/3hE46EJSHd4m6c8PO3Su/oUJHMy/refbxAscf9NNdWpGlPY285Aox
RTA0mOYQRRKf0YFkGabLY9IIcL0w+NXMhMVEMFNiXyxFvaM8CONhK6SDmzvcUngB
gOfsN6nD4JDqfH11gXCdxS3n0IZuAAMHyEigktvp1qnyNEDTBUtbfUkyqvITg+JC
u3KMFSSYB58colTK/bkhE0IHnH2bKzhkDuVKzmJn/OCTxf0xxfGsnjbdw0JxMO81
/9ORx8/QKWzv411AH2DUNh4vIJqDxVTJJb8zkScnYStX2ust6Ra+jYIr+mmf46md
+Rzo5qoe/GnAtReCdGFg3w339nEbUz51n5uqm9KN4QnH39wg5R8nPiAUMHOlO1Zm
PNvNgSZUkiiJpMci/KBbyFzPJTO7YjjRql7GWRwhWrclSPOrq49kocK5eIEYS4ol
cd5cKF92hHsnwycz2dZsDQwYqEQ5J+c6kZTwfUwJcoUBxCWP/qI=
=MNCv
-----END PGP SIGNATURE-----
Merge tag 'sev_fixes_for_v6.6' of //git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"Take care of a race between when the #VC exception is raised and when
the guest kernel gets to emulate certain instructions in SEV-{ES,SNP}
guests by:
- disabling emulation of MMIO instructions when coming from user mode
- checking the IO permission bitmap before emulating IO instructions
and verifying the memory operands of INS/OUTS insns"
* tag 'sev_fixes_for_v6.6' of //git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Check for user-space IOIO pointing to kernel space
x86/sev: Check IOBM for IOIO exceptions from user-space
x86/sev: Disable MMIO emulation from user mode
Check the memory operand of INS/OUTS before emulating the instruction.
The #VC exception can get raised from user-space, but the memory operand
can be manipulated to access kernel memory before the emulation actually
begins and after the exception handler has run.
[ bp: Massage commit message. ]
Fixes: 597cfe48212a ("x86/boot/compressed/64: Setup a GHCB-based VC Exception handler")
Reported-by: Tom Dohrmann <erbse.13@gmx.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
- Fix the handling of the phycal timer offset when FEAT_ECV
and CNTPOFF_EL2 are implemented.
- Restore the functionnality of Permission Indirection that
was broken by the Fine Grained Trapping rework
- Cleanup some PMU event sharing code
MIPS:
- Fix W=1 build.
s390:
- One small fix for gisa to avoid stalls.
x86:
- Truncate writes to PMU counters to the counter's width to avoid spurious
overflows when emulating counter events in software.
- Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined
architectural behavior).
- Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to
kick the guest out of emulated halt.
- Fix for loading XSAVE state from an old kernel into a new one.
- Fixes for AMD AVIC
selftests:
- Play nice with %llx when formatting guest printf and assert statements.
- Clean up stale test metadata.
- Zero-initialize structures in memslot perf test to workaround a suspected
"may be used uninitialized" false positives from GCC.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmUtvXgUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOE3gf/Q0Xvi/oU/+iDMuvfCbMZg/nhbrsa
WmE/zXLrCF0DknppAsWulkhLGL2ceL6X+f37f2vWpBdG9SVDG/vSAg+QQDwsXiKN
hTJoaybtMMPZM64emPZKeLMrq3A/U32qIMmWMJkoQRyz6dftUhGqZEuy1jw8oomJ
n9idRDCMkbo+bick4URt0FEuI3Tf6dPIlG7P5hObFTw+nenzzxTjoxWZ432Mgyod
yqveEke4hcQ+6K1zdAcDNZqT9ZhxeTxAO4yrBAYfnFoPLhUXKDUumkqAQPNOhKTo
YN+b29kHBm+HvYkHN785FQla/13wjE1aq5TUj5J7NEDv4uRXDefDq2OAeg==
=b9AY
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix the handling of the phycal timer offset when FEAT_ECV and
CNTPOFF_EL2 are implemented
- Restore the functionnality of Permission Indirection that was
broken by the Fine Grained Trapping rework
- Cleanup some PMU event sharing code
MIPS:
- Fix W=1 build
s390:
- One small fix for gisa to avoid stalls
x86:
- Truncate writes to PMU counters to the counter's width to avoid
spurious overflows when emulating counter events in software
- Set the LVTPC entry mask bit when handling a PMI (to match
Intel-defined architectural behavior)
- Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work
to kick the guest out of emulated halt
- Fix for loading XSAVE state from an old kernel into a new one
- Fixes for AMD AVIC
selftests:
- Play nice with %llx when formatting guest printf and assert
statements
- Clean up stale test metadata
- Zero-initialize structures in memslot perf test to workaround a
suspected 'may be used uninitialized' false positives from GCC"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
KVM: arm64: timers: Correctly handle TGE flip with CNTPOFF_EL2
KVM: arm64: POR{E0}_EL1 do not need trap handlers
KVM: arm64: Add nPIR{E0}_EL1 to HFG traps
KVM: MIPS: fix -Wunused-but-set-variable warning
KVM: arm64: pmu: Drop redundant check for non-NULL kvm_pmu_events
KVM: SVM: Fix build error when using -Werror=unused-but-set-variable
x86: KVM: SVM: refresh AVIC inhibition in svm_leave_nested()
x86: KVM: SVM: add support for Invalid IPI Vector interception
x86: KVM: SVM: always update the x2avic msr interception
KVM: selftests: Force load all supported XSAVE state in state test
KVM: selftests: Load XSAVE state into untouched vCPU during state test
KVM: selftests: Touch relevant XSAVE state in guest for state test
KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}
x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer
KVM: selftests: Zero-initialize entire test_result in memslot perf test
KVM: selftests: Remove obsolete and incorrect test case metadata
KVM: selftests: Treat %llx like %lx when formatting guest printf
KVM: x86/pmu: Synthesize at most one PMI per VM-exit
KVM: x86: Mask LVTPC when handling a PMI
KVM: x86/pmu: Truncate counter value to allowed width on write
...
This reverts commit 45e34c8af58f23db4474e2bfe79183efec09a18b, and the
two subsequent fixes to it:
3f874c9b2aae ("x86/smp: Don't send INIT to non-present and non-booted CPUs")
b1472a60a584 ("x86/smp: Don't send INIT to boot CPU")
because it seems to result in hung machines at shutdown. Particularly
some Dell machines, but Thomas says
"The rest seems to be Lenovo and Sony with Alderlake/Raptorlake CPUs -
at least that's what I could figure out from the various bug reports.
I don't know which CPUs the DELL machines have, so I can't say it's a
pattern.
I agree with the revert for now"
Ashok Raj chimes in:
"There was a report (probably this same one), and it turns out it was a
bug in the BIOS SMI handler.
The client BIOS's were waiting for the lowest APICID to be the SMI
rendevous master. If this is MeteorLake, the BSP wasn't the one with
the lowest APIC and it triped here.
The BIOS change is also being pushed to others for assimilation :)
Server BIOS's had this correctly for a while now"
and it does look likely to be some bad interaction between SMI and the
non-BSP cores having put into INIT (and thus unresponsive until reset).
Link: https://bbs.archlinux.org/viewtopic.php?pid=2124429
Link: https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
Link: https://forum.artixlinux.org/index.php/topic,5997.0.html
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2241279
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
prototypes between architectures.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUrobMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1ggUw/7BGHe360tsrsMAcOHcwvvGhnQ0UKuoqLs
IJl3dfsdX+JnL7cpbNcRBVDqgH2seIwdQFa73gALColcxntEBbnC/gVS4QLLSxSp
HIq5C1OELT/jPMOjc6aimJx/qPvW/CLgo2WJx78rv0ykkf1RJIzqCTVKf8VQX6Vu
t0/9jEhBNuL8DZthJ5ZV448WtlJcdnWXVGxq/UHEheV219Rjtp3NGf8s/K+WMzF3
x9Zhmb+/UPgjhaZtrQDP2mf7ZYgmVhLvJTRSQdQNrcDe/ZaNrCrEGOwHuOpQ0vXw
v42rd1AVGV/xgIhfBOABLdb5snBbQMDvYLcma04bkBd6H6WPFJZ1PvnGovTagxUO
FP4117VBA5ARwZemxwGEPJkNF9lVEPSBVDv7bx2OO0zVCViuAbKJXxGzCW/GiSGA
BRk5FogxJ7TcjWsYWWaZfYlq8RFI5UI3K/IEQIUpQKtC9OMhScdp532xEP48dc8u
pHjPjVoYCXtoD/ZD73ZJJduN6Hn30HEE/IJ6+YKJRo4EZquUrGtbgG46QbFBqxqW
xIPPTx7OPAaAfq020+c6BuUnta1iEY6I4De/+XbRdQf+AIWqfLsWNI8en8aO4t5p
rtlrYD2Si1F0KBWtDWR7JhCm8CD2klWVWrD9d4DLpz9ljHXKa9d7BYp3BxkvgcQm
x8f1D9yC9X4=
=20Na
-----END PGP SIGNATURE-----
Merge tag 'smp-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Ingo Molnar:
"Fix a Longsoon build warning by harmonizing the
arch_[un]register_cpu() prototypes between architectures"
* tag 'smp-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu-hotplug: Provide prototypes for arch CPU registration
Fei has reported that KASAN triggers during apply_alternatives() on
a 5-level paging machine:
BUG: KASAN: out-of-bounds in rcu_is_watching()
Read of size 4 at addr ff110003ee6419a0 by task swapper/0/0
...
__asan_load4()
rcu_is_watching()
trace_hardirqs_on()
text_poke_early()
apply_alternatives()
...
On machines with 5-level paging, cpu_feature_enabled(X86_FEATURE_LA57)
gets patched. It includes KASAN code, where KASAN_SHADOW_START depends on
__VIRTUAL_MASK_SHIFT, which is defined with cpu_feature_enabled().
KASAN gets confused when apply_alternatives() patches the
KASAN_SHADOW_START users. A test patch that makes KASAN_SHADOW_START
static, by replacing __VIRTUAL_MASK_SHIFT with 56, works around the issue.
Fix it for real by disabling KASAN while the kernel is patching alternatives.
[ mingo: updated the changelog ]
Fixes: 6657fca06e3f ("x86/mm: Allow to boot without LA57 if CONFIG_X86_5LEVEL=y")
Reported-by: Fei Yang <fei.yang@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20231012100424.1456-1-kirill.shutemov@linux.intel.com
Mask off xfeatures that aren't exposed to the guest only when saving guest
state via KVM_GET_XSAVE{2} instead of modifying user_xfeatures directly.
Preserving the maximal set of xfeatures in user_xfeatures restores KVM's
ABI for KVM_SET_XSAVE, which prior to commit ad856280ddea ("x86/kvm/fpu:
Limit guest user_xfeatures to supported bits of XCR0") allowed userspace
to load xfeatures that are supported by the host, irrespective of what
xfeatures are exposed to the guest.
There is no known use case where userspace *intentionally* loads xfeatures
that aren't exposed to the guest, but the bug fixed by commit ad856280ddea
was specifically that KVM_GET_SAVE{2} would save xfeatures that weren't
exposed to the guest, e.g. would lead to userspace unintentionally loading
guest-unsupported xfeatures when live migrating a VM.
Restricting KVM_SET_XSAVE to guest-supported xfeatures is especially
problematic for QEMU-based setups, as QEMU has a bug where instead of
terminating the VM if KVM_SET_XSAVE fails, QEMU instead simply stops
loading guest state, i.e. resumes the guest after live migration with
incomplete guest state, and ultimately results in guest data corruption.
Note, letting userspace restore all host-supported xfeatures does not fix
setups where a VM is migrated from a host *without* commit ad856280ddea,
to a target with a subset of host-supported xfeatures. However there is
no way to safely address that scenario, e.g. KVM could silently drop the
unsupported features, but that would be a clear violation of KVM's ABI and
so would require userspace to opt-in, at which point userspace could
simply be updated to sanitize the to-be-loaded XSAVE state.
Reported-by: Tyler Stachecki <stachecki.tyler@gmail.com>
Closes: https://lore.kernel.org/all/20230914010003.358162-1-tstachecki@bloomberg.net
Fixes: ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0")
Cc: stable@vger.kernel.org
Cc: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Message-Id: <20230928001956.924301-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Plumb an xfeatures mask into __copy_xstate_to_uabi_buf() so that KVM can
constrain which xfeatures are saved into the userspace buffer without
having to modify the user_xfeatures field in KVM's guest_fpu state.
KVM's ABI for KVM_GET_XSAVE{2} is that features that are not exposed to
guest must not show up in the effective xstate_bv field of the buffer.
Saving only the guest-supported xfeatures allows userspace to load the
saved state on a different host with a fewer xfeatures, so long as the
target host supports the xfeatures that are exposed to the guest.
KVM currently sets user_xfeatures directly to restrict KVM_GET_XSAVE{2} to
the set of guest-supported xfeatures, but doing so broke KVM's historical
ABI for KVM_SET_XSAVE, which allows userspace to load any xfeatures that
are supported by the *host*.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928001956.924301-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Provide common prototypes for arch_register_cpu() and
arch_unregister_cpu(). These are called by acpi_processor.c, with weak
versions, so the prototype for this is already set. It is generally not
necessary for function prototypes to be conditional on preprocessor macros.
Some architectures (e.g. Loongarch) are missing the prototype for this, and
rather than add it to Loongarch's asm/cpu.h, do the job once for everyone.
Since this covers everyone, remove the now unnecessary prototypes in
asm/cpu.h, and therefore remove the 'static' from one of ia64's
arch_register_cpu() definitions.
[ tglx: Bring back the ia64 part and remove the ACPI prototypes ]
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/E1qkoRr-0088Q8-Da@rmk-PC.armlinux.org.uk
Fix erratum #1485 on Zen4 parts where running with STIBP disabled can
cause an #UD exception. The performance impact of the fix is negligible.
Reported-by: René Rebe <rene@exactcode.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: René Rebe <rene@exactcode.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/D99589F4-BC5D-430B-87B2-72C20370CF57@exactcode.com
Check the IO permission bitmap (if present) before emulating IOIO #VC
exceptions for user-space. These permissions are checked by hardware
already before the #VC is raised, but due to the VC-handler decoding
race it needs to be checked again in software.
Fixes: 25189d08e516 ("x86/sev-es: Add support for handling IOIO exceptions")
Reported-by: Tom Dohrmann <erbse.13@gmx.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Dohrmann <erbse.13@gmx.de>
Cc: <stable@kernel.org>
A virt scenario can be constructed where MMIO memory can be user memory.
When that happens, a race condition opens between when the hardware
raises the #VC and when the #VC handler gets to emulate the instruction.
If the MOVS is replaced with a MOVS accessing kernel memory in that
small race window, then write to kernel memory happens as the access
checks are not done at emulation time.
Disable MMIO emulation in user mode temporarily until a sensible use
case appears and justifies properly handling the race window.
Fixes: 0118b604c2c9 ("x86/sev-es: Handle MMIO String Instructions")
Reported-by: Tom Dohrmann <erbse.13@gmx.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Dohrmann <erbse.13@gmx.de>
Cc: <stable@kernel.org>
The kernel test robot reported kernel-doc warnings here:
monitor.c:34: warning: Cannot understand * @rmid_free_lru A least recently used list of free RMIDs on line 34 - I thought it was a doc line
monitor.c:41: warning: Cannot understand * @rmid_limbo_count count of currently unused but (potentially) on line 41 - I thought it was a doc line
monitor.c:50: warning: Cannot understand * @rmid_entry - The entry in the limbo and free lists. on line 50 - I thought it was a doc line
We don't have a syntax for documenting individual data items via
kernel-doc, so remove the "/**" kernel-doc markers and add a hyphen
for consistency.
Fixes: 6a445edce657 ("x86/intel_rdt/cqm: Add RDT monitoring initialization")
Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231006235132.16227-1-rdunlap@infradead.org
In snp_accept_memory(), the npages variables value is calculated from
phys_addr_t variables but is an unsigned int. A very large range passed
into snp_accept_memory() could lead to truncating npages to zero. This
doesn't happen at the moment but let's be prepared.
Fixes: 6c3211796326 ("x86/sev: Add SNP-specific unaccepted memory support")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/6d511c25576494f682063c9fb6c705b526a3757e.1687441505.git.thomas.lendacky@amd.com
SNP retrieves the majority of CPUID information from the SNP CPUID page.
But there are times when that information needs to be supplemented by the
hypervisor, for example, obtaining the initial APIC ID of the vCPU from
leaf 1.
The current implementation uses the MSR protocol to retrieve the data from
the hypervisor, even when a GHCB exists. The problem arises when an NMI
arrives on return from the VMGEXIT. The NMI will be immediately serviced
and may generate a #VC requiring communication with the hypervisor.
Since a GHCB exists in this case, it will be used. As part of using the
GHCB, the #VC handler will write the GHCB physical address into the GHCB
MSR and the #VC will be handled.
When the NMI completes, processing resumes at the site of the VMGEXIT
which is expecting to read the GHCB MSR and find a CPUID MSR protocol
response. Since the NMI handling overwrote the GHCB MSR response, the
guest will see an invalid reply from the hypervisor and self-terminate.
Fix this problem by using the GHCB when it is available. Any NMI
received is properly handled because the GHCB contents are copied into
a backup page and restored on NMI exit, thus preserving the active GHCB
request or result.
[ bp: Touchups. ]
Fixes: ee0bfa08a345 ("x86/compressed/64: Add support for SEV-SNP CPUID table in #VC handlers")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/a5856fa1ebe3879de91a8f6298b6bbd901c61881.1690578565.git.thomas.lendacky@amd.com
The SGX EPC reclaimer (ksgxd) may reclaim the SECS EPC page for an
enclave and set secs.epc_page to NULL. The SECS page is used for EAUG
and ELDU in the SGX page fault handler. However, the NULL check for
secs.epc_page is only done for ELDU, not EAUG before being used.
Fix this by doing the same NULL check and reloading of the SECS page as
needed for both EAUG and ELDU.
The SECS page holds global enclave metadata. It can only be reclaimed
when there are no other enclave pages remaining. At that point,
virtually nothing can be done with the enclave until the SECS page is
paged back in.
An enclave can not run nor generate page faults without a resident SECS
page. But it is still possible for a #PF for a non-SECS page to race
with paging out the SECS page: when the last resident non-SECS page A
triggers a #PF in a non-resident page B, and then page A and the SECS
both are paged out before the #PF on B is handled.
Hitting this bug requires that race triggered with a #PF for EAUG.
Following is a trace when it happens.
BUG: kernel NULL pointer dereference, address: 0000000000000000
RIP: 0010:sgx_encl_eaug_page+0xc7/0x210
Call Trace:
? __kmem_cache_alloc_node+0x16a/0x440
? xa_load+0x6e/0xa0
sgx_vma_fault+0x119/0x230
__do_fault+0x36/0x140
do_fault+0x12f/0x400
__handle_mm_fault+0x728/0x1110
handle_mm_fault+0x105/0x310
do_user_addr_fault+0x1ee/0x750
? __this_cpu_preempt_check+0x13/0x20
exc_page_fault+0x76/0x180
asm_exc_page_fault+0x27/0x30
Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20230728051024.33063-1-haitao.huang%40linux.intel.com
When compiled with W=1, the following warning is generated:
arch/x86/kernel/kgdb.c:698: warning: Cannot understand *
on line 698 - I thought it was a doc line
Remove the corresponding empty comment line to fix the warning.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/aad659537c1d4ebd86912a6f0be458676c8e69af.1695401178.git.christophe.jaillet@wanadoo.fr
- Fix a kexec bug,
- Fix an UML build bug,
- Fix a handful of SRSO related bugs,
- Fix a shadow stacks handling bug & robustify related code.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUNbQIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jVIg/9EChW7qFTda8joR41Uayg07VIOpGirDLu
7hjzOnt4Ni93cfFNUBkKDKXoWxGAiOD+cRDnT6+zsJAvAZR26Y3UNVLYlAy+lFKK
9kRxeDM7nOEKqCC+zneinMFcIKmZRttMLpj8O901jB2S08x4UarnNx5SaWEcqYbn
rf1XIEuEvlxqMfafNueS/TRadV52qVU8Y+2+inkDnM7dDNwt+rCs5tQ9ebJos8QO
tsAoQes1G+0mjPrpqAgsIic5e3QCHliwVr8ASQrKZ9KR+fokEJRbSBNjgHUCNeVN
0bHBhuDJBSniC7QmAQGEizrZWmHl2HxwYYnCE0gd0g24b7sDTwWBFSXWratCrPdX
e4qYq36xonWdQcbpVF8ijMXF/R810vDyis/yc/BTeo5EBWf6aM/cx1dmS9DUxRpA
QsIW2amSCPVYwYE5MAC+K/DM9nq1tk8YnKi4Mob3L28+W3CmVwSwT9S86z2QLlZu
+KgVV4yBtJY1FJqVur5w3awhFtqLfBdfIvs6uQCd9VZXVPbBfS8+rOQmmhFixEDu
FSPlAChmXYTAM2R+UxcEvw1Zckrtd2BCOa8UrY2lq57mduBK1EymdpfjlrUAChLG
x7fQBOGNgOTLwYcsurIdS5jAqiudpnJ/KDG8ZAmKsVoW96JCPp9B3tVZMp9tT30C
8HRwSPX4384=
=58St
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix a kexec bug
- Fix an UML build bug
- Fix a handful of SRSO related bugs
- Fix a shadow stacks handling bug & robustify related code
* tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/shstk: Add warning for shadow stack double unmap
x86/shstk: Remove useless clone error handling
x86/shstk: Handle vfork clone failure correctly
x86/srso: Fix SBPB enablement for spec_rstack_overflow=off
x86/srso: Don't probe microcode in a guest
x86/srso: Set CPUID feature bits independently of bug or mitigation status
x86/srso: Fix srso_show_state() side effect
x86/asm: Fix build of UML with KASAN
x86/mm, kexec, ima: Use memblock_free_late() from ima_free_kexec_buffer()
Commit
7825451fa4dc ("static_call: Add call depth tracking support")
failed to realize the problem fixed there is not specific to call depth
tracking but applies to all return-thunk uses.
Move the fix to the appropriate place and condition.
Fixes: ee88d363d156 ("x86,static_call: Use alternative RET encoding")
Reported-by: David Kaplan <David.Kaplan@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
The following commit
095b8303f383 ("x86/alternative: Make custom return thunk unconditional")
made '__x86_return_thunk' a placeholder value. All code setting
X86_FEATURE_RETHUNK also changes the value of 'x86_return_thunk'. So
the optimization at the beginning of apply_returns() is dead code.
Also, before the above-mentioned commit, the optimization actually had a
bug It bypassed __static_call_fixup(), causing some raw returns to
remain unpatched in static call trampolines. Thus the 'Fixes' tag.
Fixes: d2408e043e72 ("x86/alternative: Optimize returns patching")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/16d19d2249d4485d8380fb215ffaae81e6b8119e.1693889988.git.jpoimboe@kernel.org
There are several ways a thread's shadow stacks can get unmapped. This
can happen on exit or exec, as well as error handling in exec or clone.
The task struct already keeps track of the thread's shadow stack. Use the
size variable to keep track of if the shadow stack has already been freed.
When an attempt to double unmap the thread shadow stack is caught, warn
about it and abort the operation.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: H.J. Lu <hjl.tools@gmail.com>
Link: https://lore.kernel.org/all/20230908203655.543765-4-rick.p.edgecombe%40intel.com
When clone fails after the shadow stack is allocated, any allocated shadow
stack is cleaned up in exit_thread() in copy_process(). So the logic in
copy_thread() is unneeded, and also will not handle failures that happen
outside of copy_thread().
In addition, since there is a second attempt to unmap the same shadow
stack, there is a race where an newly mapped region could get unmapped.
So remove the logic in copy_thread() and rely on exit_thread() to handle
clone failure.
Fixes: b2926a36b97a ("x86/shstk: Handle thread shadow stack")
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: H.J. Lu <hjl.tools@gmail.com>
Link: https://lore.kernel.org/all/20230908203655.543765-3-rick.p.edgecombe%40intel.com
Shadow stacks are allocated automatically and freed on exit, depending
on the clone flags. The two cases where new shadow stacks are not
allocated are !CLONE_VM (fork()) and CLONE_VFORK (vfork()). For
!CLONE_VM, although a new stack is not allocated, it can be freed normally
because it will happen in the child's copy of the VM.
However, for CLONE_VFORK the parent and the child are actually using the
same shadow stack. So the kernel doesn't need to allocate *or* free a
shadow stack for a CLONE_VFORK child. CLONE_VFORK children already need
special tracking to avoid returning to userspace until the child exits or
execs. Shadow stack uses this same tracking to avoid freeing CLONE_VFORK
shadow stacks.
However, the tracking is not setup until the clone has succeeded
(internally). Which means, if a CLONE_VFORK fails, the existing logic will
not know it is a CLONE_VFORK and proceed to unmap the parents shadow stack.
This error handling cleanup logic runs via exit_thread() in the
bad_fork_cleanup_thread label in copy_process(). The issue was seen in
the glibc test "posix/tst-spawn3-pidfd" while running with shadow stack
using currently out-of-tree glibc patches.
Fix it by not unmapping the vfork shadow stack in the error case as well.
Since clone is implemented in core code, it is not ideal to pass the clone
flags along the error path in order to have shadow stack code have
symmetric logic in the freeing half of the thread shadow stack handling.
Instead use the existing state for thread shadow stacks to track whether
the thread is managing its own shadow stack. For CLONE_VFORK, simply set
shstk->base and shstk->size to 0, and have it mean the thread is not
managing a shadow stack and so should skip cleanup work. Implement this
by breaking up the CLONE_VFORK and !CLONE_VM cases in
shstk_alloc_thread_stack() to separate conditionals since, the logic is
now different between them. In the case of CLONE_VFORK && !CLONE_VM, the
existing behavior is to not clean up the shadow stack in the child (which
should go away quickly with either be exit or exec), so maintain that
behavior by handling the CLONE_VFORK case first in the allocation path.
This new logioc cleanly handles the case of normal, successful
CLONE_VFORK's skipping cleaning up their shadow stack's on exit as well.
So remove the existing, vfork shadow stack freeing logic. This is in
deactivate_mm() where vfork_done is used to tell if it is a vfork child
that can skip cleaning up the thread shadow stack.
Fixes: b2926a36b97a ("x86/shstk: Handle thread shadow stack")
Reported-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: H.J. Lu <hjl.tools@gmail.com>
Link: https://lore.kernel.org/all/20230908203655.543765-2-rick.p.edgecombe%40intel.com
If the user has requested no SRSO mitigation, other mitigations can use
the lighter-weight SBPB instead of IBPB.
Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/b20820c3cfd1003171135ec8d762a0b957348497.1693889988.git.jpoimboe@kernel.org
To support live migration, the hypervisor sets the "lowest common
denominator" of features. Probing the microcode isn't allowed because
any detected features might go away after a migration.
As Andy Cooper states:
"Linux must not probe microcode when virtualised. What it may see
instantaneously on boot (owing to MSR_PRED_CMD being fully passed
through) is not accurate for the lifetime of the VM."
Rely on the hypervisor to set the needed IBPB_BRTYPE and SBPB bits.
Fixes: 1b5277c0ea0b ("x86/srso: Add SRSO_NO support")
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/3938a7209606c045a3f50305d201d840e8c834c7.1693889988.git.jpoimboe@kernel.org
Booting with mitigations=off incorrectly prevents the
X86_FEATURE_{IBPB_BRTYPE,SBPB} CPUID bits from getting set.
Also, future CPUs without X86_BUG_SRSO might still have IBPB with branch
type prediction flushing, in which case SBPB should be used instead of
IBPB. The current code doesn't allow for that.
Also, cpu_has_ibpb_brtype_microcode() has some surprising side effects
and the setting of these feature bits really doesn't belong in the
mitigation code anyway. Move it to earlier.
Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/869a1709abfe13b673bdd10c2f4332ca253a40bc.1693889988.git.jpoimboe@kernel.org
Reading the 'spec_rstack_overflow' sysfs file can trigger an unnecessary
MSR write, and possibly even a (handled) exception if the microcode
hasn't been updated.
Avoid all that by just checking X86_FEATURE_IBPB_BRTYPE instead, which
gets set by srso_select_mitigation() if the updated microcode exists.
Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/27d128899cb8aee9eb2b57ddc996742b0c1d776b.1693889988.git.jpoimboe@kernel.org
Only Xen is using the paravirt lazy mode code, so it can be moved to
Xen specific sources.
This allows to make some of the functions static or to merge them into
their only call sites.
While at it do a rename from "paravirt" to "xen" for all moved
specifiers.
No functional change.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20230913113828.18421-3-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
The code calling ima_free_kexec_buffer() runs long after the memblock
allocator has already been torn down, potentially resulting in a use
after free in memblock_isolate_range().
With KASAN or KFENCE, this use after free will result in a BUG
from the idle task, and a subsequent kernel panic.
Switch ima_free_kexec_buffer() over to memblock_free_late() to avoid
that bug.
Fixes: fee3ff99bc67 ("powerpc: Move arch independent ima kexec functions to drivers/of/kexec.c")
Suggested-by: Mike Rappoport <rppt@kernel.org>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230817135558.67274c83@imladris.surriel.com
- Fix an UV boot crash,
- Skip spurious ENDBR generation on _THIS_IP_,
- Fix ENDBR use in putuser() asm methods,
- Fix corner case boot crashes on 5-level paging,
- and fix a false positive WARNING on LTO kernels.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUHOrwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j6Jw/+PjUfh/4+KM/Z8VzcBy2UhY3NMF2ptGCN
47FPLy+8ADqOvIfgsPsBEO9VXwdvkHfH64YWRUlULjvPNOSs+37drBYMe9AI9xKE
u6NhoBHmsnOtoLkBFIQYlJys9GyAeoMlwSSHxzRwQS+3VztRjoH636jiBcg/h7DR
zhakfnJD1SSOZuEyyDPnO0uIUarrcqC2jdZwucSqZnvZFdA/pexUHQEe2RtMXLln
EIA5kuEo7UdADcSr/Lbty7MKO+6xpRTjxF0yNItPtwPWsnxrSAC7P+dQ37uB975U
Z0CJ/kw54XG5Sdoech7XCWYmJzDxSPhiziA1USKpZJMo5phAG/apTJK6NG4TG94r
U+3QhLopUoSd8N74/VtZn0FjOrMsk7YKD5o8twlTbpCd2VaiJk4X5oBIns6Tvq05
0Vmsx15XO3mEzg1wWbbdjhjHW0czRgBpikS9QyexZKVkukYcW8QT6bk1gK1ypg94
do4PHKB53QIt31dedy/ddpQj4u1sJ4+a9/+IG29kjUB33M4+uFedtw11vfe+CDN0
XLRc6HfPblogIIEO/figJgwSq/TPCOsNHTwKkejq+D1Ey6DsrnX9Gg4BWVz/3dDW
84SW7TaO2FGEDh4NkR2ijkZlbpofFnCvhCh/uohodPlqDrTVTuRKCZW9I5plmGVa
qeXUpNDFs1E=
=BMjT
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- Fix an UV boot crash
- Skip spurious ENDBR generation on _THIS_IP_
- Fix ENDBR use in putuser() asm methods
- Fix corner case boot crashes on 5-level paging
- and fix a false positive WARNING on LTO kernels"
* tag 'x86-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/purgatory: Remove LTO flags
x86/boot/compressed: Reserve more memory for page tables
x86/ibt: Avoid duplicate ENDBR in __put_user_nocheck*()
x86/ibt: Suppress spurious ENDBR
x86/platform/uv: Use alternate source for socket to node data
balancing bug, and a topology setup bug on (Intel) hybrid processors.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUHOVQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iOahAAj3YsoNbT/k6m9yp622n1OopaNEQvsK+/
F2Q5g/hJrm3+W5764rF8CvjhDbmrv6owjp3yUyZLDIfSAFZYMvwoNody3a373Yr3
VFBMJ00jNIv/TAFCJZYeybg3yViwObKKfpu4JBj//QU+4uGWCoBMolkVekU2bBti
r50fMxBPgg2Yic57DCC8Y+JZzHI/ydQ3rvVXMzkrTZCO/zY4/YmERM9d+vp4wl4B
uG9cfXQ4Yf/1gZo0XDlTUkOJUXPnkMgi+N4eHYlGuyOCoIZOfATI24hRaPBoQcdx
PDwHcKmyNxH9iaRppNQMvi797g3KrKVEmZwlZg1IfsILhKC0F4GsQ85tw8qQWE8j
brFPkWVUxAUSl4LXoqVInaxDHmJWR2UC3RA7b+BxFF/GMLTow0z4a+JMC6eKlNyR
uBisZnuEuecqwF9TLhyD3KBHh7PihUPz8PuFHk+Um5sggaUE82I+VwX6uxEi5y8r
ke2kAkpuMxPWT5lwDmFPAXWfvpZz5vvTIRUxGGj2+s4d8b0YfLtZsx5+uOIacaub
Gw+wYFfSowph72tR/SUVq0An/UTSPPBxty8eFIVeE6sW9bw3ghTtkf8300xjV7Rj
sKVxXy/podAo8wG7R6aZfTfsCpohmeEjskiatYdThYamPPx7V0R5pq4twmTXTHLJ
bFvQ1GFCOu0=
=jIeN
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Fix a performance regression on large SMT systems, an Intel SMT4
balancing bug, and a topology setup bug on (Intel) hybrid processors"
* tag 'sched-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sched: Restore the SD_ASYM_PACKING flag in the DIE domain
sched/fair: Fix SMT4 group_smt_balance handling
sched/fair: Optimize should_we_balance() for large SMT systems