2019-05-27 09:55:01 +03:00
// SPDX-License-Identifier: GPL-2.0-or-later
2005-09-26 10:04:21 +04:00
/*
* This program is used to generate definitions needed by
* assembly language modules .
*
* We use the technique used in the OSF Mach kernel code :
* generate asm statements containing # defines ,
* compile this file to assembler , and then extract the
* # defines from the assembly - language output .
*/
2019-01-31 13:08:58 +03:00
# define GENERATING_ASM_OFFSETS /* asm/smp.h */
2018-03-14 07:03:25 +03:00
# include <linux/compat.h>
2005-09-26 10:04:21 +04:00
# include <linux/signal.h>
# include <linux/sched.h>
# include <linux/kernel.h>
# include <linux/errno.h>
# include <linux/string.h>
# include <linux/types.h>
# include <linux/mman.h>
# include <linux/mm.h>
2007-05-03 16:31:38 +04:00
# include <linux/suspend.h>
2008-02-05 08:16:48 +03:00
# include <linux/hrtimer.h>
2005-09-28 18:35:31 +04:00
# ifdef CONFIG_PPC64
2005-09-26 10:04:21 +04:00
# include <linux/time.h>
# include <linux/hardirq.h>
2005-09-28 18:35:31 +04:00
# endif
2008-04-29 12:04:08 +04:00
# include <linux/kbuild.h>
2005-09-28 18:35:31 +04:00
2005-09-26 10:04:21 +04:00
# include <asm/io.h>
# include <asm/page.h>
# include <asm/processor.h>
# include <asm/cputable.h>
# include <asm/thread_info.h>
2005-10-26 11:05:24 +04:00
# include <asm/rtas.h>
2005-11-11 13:15:21 +03:00
# include <asm/vdso_datapage.h>
KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8
This uses msgsnd where possible for signalling other threads within
the same core on POWER8 systems, rather than IPIs through the XICS
interrupt controller. This includes waking secondary threads to run
the guest, the interrupts generated by the virtual XICS, and the
interrupts to bring the other threads out of the guest when exiting.
Aggregated statistics from debugfs across vcpus for a guest with 32
vcpus, 8 threads/vcore, running on a POWER8, show this before the
change:
rm_entry: 3387.6ns (228 - 86600, 1008969 samples)
rm_exit: 4561.5ns (12 - 3477452, 1009402 samples)
rm_intr: 1660.0ns (12 - 553050, 3600051 samples)
and this after the change:
rm_entry: 3060.1ns (212 - 65138, 953873 samples)
rm_exit: 4244.1ns (12 - 9693408, 954331 samples)
rm_intr: 1342.3ns (12 - 1104718, 3405326 samples)
for a test of booting Fedora 20 big-endian to the login prompt.
The time taken for a H_PROD hcall (which is handled in the host
kernel) went down from about 35 microseconds to about 16 microseconds
with this change.
The noinline added to kvmppc_run_core turned out to be necessary for
good performance, at least with gcc 4.9.2 as packaged with Fedora 21
and a little-endian POWER8 host.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-28 06:21:12 +03:00
# include <asm/dbell.h>
2005-09-26 10:04:21 +04:00
# ifdef CONFIG_PPC64
# include <asm/paca.h>
# include <asm/lppaca.h>
# include <asm/cache.h>
2006-08-09 11:00:30 +04:00
# include <asm/mmu.h>
2006-09-13 22:32:39 +04:00
# include <asm/hvcall.h>
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
core), whenever a CPU goes idle, we have to pull all the other
hardware threads in the core out of the guest, because the H_CEDE
hcall is handled in the kernel. This is inefficient.
This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
in real mode. When a guest vcpu does an H_CEDE hcall, we now only
exit to the kernel if all the other vcpus in the same core are also
idle. Otherwise we mark this vcpu as napping, save state that could
be lost in nap mode (mainly GPRs and FPRs), and execute the nap
instruction. When the thread wakes up, because of a decrementer or
external interrupt, we come back in at kvm_start_guest (from the
system reset interrupt vector), find the `napping' flag set in the
paca, and go to the resume path.
This has some other ramifications. First, when starting a core, we
now start all the threads, both those that are immediately runnable and
those that are idle. This is so that we don't have to pull all the
threads out of the guest when an idle thread gets a decrementer interrupt
and wants to start running. In fact the idle threads will all start
with the H_CEDE hcall returning; being idle they will just do another
H_CEDE immediately and go to nap mode.
This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
These functions have been restructured to make them simpler and clearer.
We introduce a level of indirection in the wait queue that gets woken
when external and decrementer interrupts get generated for a vcpu, so
that we can have the 4 vcpus in a vcore using the same wait queue.
We need this because the 4 vcpus are being handled by one thread.
Secondly, when we need to exit from the guest to the kernel, we now
have to generate an IPI for any napping threads, because an HDEC
interrupt doesn't wake up a napping thread.
Thirdly, we now need to be able to handle virtual external interrupts
and decrementer interrupts becoming pending while a thread is napping,
and deliver those interrupts to the guest when the thread wakes.
This is done in kvmppc_cede_reentry, just before fast_guest_return.
Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
from kvm_arch_vcpu_runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 11:42:46 +04:00
# include <asm/xics.h>
2005-09-26 10:04:21 +04:00
# endif
2011-09-19 21:45:04 +04:00
# ifdef CONFIG_PPC_POWERNV
# include <asm/opal.h>
# endif
2010-08-30 14:01:56 +04:00
# if defined(CONFIG_KVM) || defined(CONFIG_KVM_GUEST)
2009-01-04 01:23:08 +03:00
# include <linux/kvm_host.h>
2010-04-16 02:11:44 +04:00
# endif
2010-08-30 14:01:56 +04:00
# if defined(CONFIG_KVM) && defined(CONFIG_PPC_BOOK3S)
# include <asm/kvm_book3s.h>
2014-04-24 15:46:24 +04:00
# include <asm/kvm_ppc.h>
2008-11-05 18:36:18 +03:00
# endif
2005-09-26 10:04:21 +04:00
2009-07-28 05:59:34 +04:00
# ifdef CONFIG_PPC32
2008-04-30 14:23:21 +04:00
# if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
# include "head_booke.h"
# endif
2009-07-28 05:59:34 +04:00
# endif
2008-04-30 14:23:21 +04:00
2009-10-17 03:48:40 +04:00
# if defined(CONFIG_PPC_FSL_BOOK3E)
2008-12-09 06:34:55 +03:00
# include "../mm/mmu_decl.h"
# endif
powerpc/8xx: Fix vaddr for IMMR early remap
Memory: 124428K/131072K available (3748K kernel code, 188K rwdata,
648K rodata, 508K init, 290K bss, 6644K reserved)
Kernel virtual memory layout:
* 0xfffdf000..0xfffff000 : fixmap
* 0xfde00000..0xfe000000 : consistent mem
* 0xfddf6000..0xfde00000 : early ioremap
* 0xc9000000..0xfddf6000 : vmalloc & ioremap
SLUB: HWalign=16, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
Today, IMMR is mapped 1:1 at startup
Mapping IMMR 1:1 is just wrong because it may overlap with another
area. On most mpc8xx boards it is OK as IMMR is set to 0xff000000
but for instance on EP88xC board, IMMR is at 0xfa200000 which
overlaps with VM ioremap area
This patch fixes the virtual address for remapping IMMR with the fixmap
regardless of the value of IMMR.
The size of IMMR area is 256kbytes (CPM at offset 0, security engine
at offset 128k) so a 512k page is enough
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-05-17 10:02:43 +03:00
# ifdef CONFIG_PPC_8xx
# include <asm/fixmap.h>
# endif
2020-05-06 06:40:23 +03:00
# ifdef CONFIG_XMON
# include "../xmon/xmon_bpts.h"
# endif
2016-06-02 07:29:47 +03:00
# define STACK_PT_REGS_OFFSET(sym, val) \
DEFINE ( sym , STACK_FRAME_OVERHEAD + offsetof ( struct pt_regs , val ) )
2005-09-26 10:04:21 +04:00
int main ( void )
{
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD , task_struct , thread ) ;
OFFSET ( MM , task_struct , mm ) ;
2018-09-27 10:05:53 +03:00
# ifdef CONFIG_STACKPROTECTOR
OFFSET ( TASK_CANARY , task_struct , stack_canary ) ;
2018-09-27 10:05:55 +03:00
# ifdef CONFIG_PPC64
OFFSET ( PACA_CANARY , paca_struct , canary ) ;
# endif
2018-09-27 10:05:53 +03:00
# endif
powerpc/asm-offset: Remove unused items
Following PACA related items are not used anymore by ASM code:
PACA_SIZE, PACACONTEXTID, PACALOWSLICESPSIZE, PACAHIGHSLICEPSIZE,
PACA_SLB_ADDR_LIMIT, MMUPSIZEDEFSIZE, PACASLBCACHE, PACASLBCACHEPTR,
PACASTABRR, PACAVMALLOCSLLP, MMUPSIZESLLP, PACACONTEXTSLLP,
PACALPPACAPTR, LPPACA_DTLIDX and PACA_DTL_RIDX.
Following items are also not used anymore:
SIGSEGV, NMI_MASK, THREAD_DBCR0, KUAP, TI_FLAGS, TI_PREEMPT,
DCACHEL1BLOCKSPERPAGE, ICACHEL1BLOCKSIZE, ICACHEL1LOGBLOCKSIZE,
ICACHEL1BLOCKSPERPAGE, STACK_REGS_KUAP, KVM_NEED_FLUSH, KVM_FWNMI,
VCPU_DEC, VCPU_SPMC, HSTATE_XICS_PHYS, HSTATE_SAVED_XIRR and
PPC_DBELL_MSGTYPE.
Remove all of them.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1c80981548dc0c4f145109cdd473022c1aad8d2b.1620223302.git.christophe.leroy@csgroup.eu
2021-05-05 17:02:12 +03:00
# ifdef CONFIG_PPC32
2019-02-21 13:37:54 +03:00
# ifdef CONFIG_PPC_RTAS
OFFSET ( RTAS_SP , thread_struct , rtas_sp ) ;
# endif
2005-09-28 18:35:31 +04:00
# endif /* CONFIG_PPC64 */
2019-01-31 13:08:54 +03:00
OFFSET ( TASK_STACK , task_struct , stack ) ;
2019-01-31 13:08:58 +03:00
# ifdef CONFIG_SMP
2019-01-31 13:09:04 +03:00
OFFSET ( TASK_CPU , task_struct , cpu ) ;
2019-01-31 13:08:58 +03:00
# endif
2005-09-28 18:35:31 +04:00
powerpc/livepatch: Add live patching support on ppc64le
Add the kconfig logic & assembly support for handling live patched
functions. This depends on DYNAMIC_FTRACE_WITH_REGS, which in turn
depends on the new -mprofile-kernel ftrace ABI, which is only supported
currently on ppc64le.
Live patching is handled by a special ftrace handler. This means it runs
from ftrace_caller(). The live patch handler modifies the NIP so as to
redirect the return from ftrace_caller() to the new patched function.
However there is one particularly tricky case we need to handle.
If a function A calls another function B, and it is known at link time
that they share the same TOC, then A will not save or restore its TOC,
and will call the local entry point of B.
When we live patch B, we replace it with a new function C, which may
not have the same TOC as A. At live patch time it's too late to modify A
to do the TOC save/restore, so the live patching code must interpose
itself between A and C, and do the TOC save/restore that A omitted.
An additionaly complication is that the livepatch code can not create a
stack frame in order to save the TOC. That is because if C takes > 8
arguments, or is varargs, A will have written the arguments for C in
A's stack frame.
To solve this, we introduce a "livepatch stack" which grows upward from
the base of the regular stack, and is used to store the TOC & LR when
calling a live patched function.
When the patched function returns, we retrieve the real LR & TOC from
the livepatch stack, restore them, and pop the livepatch "stack frame".
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Torsten Duwe <duwe@suse.de>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
2016-03-24 14:04:05 +03:00
# ifdef CONFIG_LIVEPATCH
2017-02-15 13:41:20 +03:00
OFFSET ( TI_livepatch_sp , thread_info , livepatch_sp ) ;
powerpc/livepatch: Add live patching support on ppc64le
Add the kconfig logic & assembly support for handling live patched
functions. This depends on DYNAMIC_FTRACE_WITH_REGS, which in turn
depends on the new -mprofile-kernel ftrace ABI, which is only supported
currently on ppc64le.
Live patching is handled by a special ftrace handler. This means it runs
from ftrace_caller(). The live patch handler modifies the NIP so as to
redirect the return from ftrace_caller() to the new patched function.
However there is one particularly tricky case we need to handle.
If a function A calls another function B, and it is known at link time
that they share the same TOC, then A will not save or restore its TOC,
and will call the local entry point of B.
When we live patch B, we replace it with a new function C, which may
not have the same TOC as A. At live patch time it's too late to modify A
to do the TOC save/restore, so the live patching code must interpose
itself between A and C, and do the TOC save/restore that A omitted.
An additionaly complication is that the livepatch code can not create a
stack frame in order to save the TOC. That is because if C takes > 8
arguments, or is varargs, A will have written the arguments for C in
A's stack frame.
To solve this, we introduce a "livepatch stack" which grows upward from
the base of the regular stack, and is used to store the TOC & LR when
calling a live patched function.
When the patched function returns, we retrieve the real LR & TOC from
the livepatch stack, restore them, and pop the livepatch "stack frame".
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Torsten Duwe <duwe@suse.de>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
2016-03-24 14:04:05 +03:00
# endif
2017-02-15 13:41:20 +03:00
OFFSET ( KSP , thread_struct , ksp ) ;
OFFSET ( PT_REGS , thread_struct , regs ) ;
2011-04-23 01:48:27 +04:00
# ifdef CONFIG_BOOKE
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_NORMSAVES , thread_struct , normsave [ 0 ] ) ;
2011-04-23 01:48:27 +04:00
# endif
2020-08-18 20:19:17 +03:00
# ifdef CONFIG_PPC_FPU
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_FPEXC_MODE , thread_struct , fpexc_mode ) ;
2017-05-08 09:23:31 +03:00
OFFSET ( THREAD_FPSTATE , thread_struct , fp_state . fpr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_FPSAVEAREA , thread_struct , fp_save_area ) ;
2020-08-18 20:19:17 +03:00
# endif
2017-02-15 13:41:20 +03:00
OFFSET ( FPSTATE_FPSCR , thread_fp_state , fpscr ) ;
OFFSET ( THREAD_LOAD_FP , thread_struct , load_fp ) ;
2005-09-26 10:04:21 +04:00
# ifdef CONFIG_ALTIVEC
2017-05-08 09:23:31 +03:00
OFFSET ( THREAD_VRSTATE , thread_struct , vr_state . vr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_VRSAVEAREA , thread_struct , vr_save_area ) ;
OFFSET ( THREAD_USED_VR , thread_struct , used_vr ) ;
OFFSET ( VRSTATE_VSCR , thread_vr_state , vscr ) ;
OFFSET ( THREAD_LOAD_VEC , thread_struct , load_vec ) ;
2005-09-26 10:04:21 +04:00
# endif /* CONFIG_ALTIVEC */
2008-06-25 08:07:18 +04:00
# ifdef CONFIG_VSX
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_USED_VSR , thread_struct , used_vsr ) ;
2008-06-25 08:07:18 +04:00
# endif /* CONFIG_VSX */
2005-09-28 18:35:31 +04:00
# ifdef CONFIG_PPC64
2017-02-15 13:41:20 +03:00
OFFSET ( KSP_VSID , thread_struct , ksp_vsid ) ;
2005-09-28 18:35:31 +04:00
# else /* CONFIG_PPC64 */
2017-02-15 13:41:20 +03:00
OFFSET ( PGDIR , thread_struct , pgdir ) ;
2019-12-21 11:32:27 +03:00
OFFSET ( SRR0 , thread_struct , srr0 ) ;
OFFSET ( SRR1 , thread_struct , srr1 ) ;
OFFSET ( DAR , thread_struct , dar ) ;
OFFSET ( DSISR , thread_struct , dsisr ) ;
powerpc/32s: Fix DSI and ISI exceptions for CONFIG_VMAP_STACK
hash_page() needs to read page tables from kernel memory. When entire
kernel memory is mapped by BATs, which is normally the case when
CONFIG_STRICT_KERNEL_RWX is not set, it works even if the page hosting
the page table is not referenced in the MMU hash table.
However, if the page where the page table resides is not covered by
a BAT, a DSI fault can be encountered from hash_page(), and it loops
forever. This can happen when CONFIG_STRICT_KERNEL_RWX is selected
and the alignment of the different regions is too small to allow
covering the entire memory with BATs. This also happens when
CONFIG_DEBUG_PAGEALLOC is selected or when booting with 'nobats'
flag.
Also, if the page containing the kernel stack is not present in the
MMU hash table, registers cannot be saved and a recursive DSI fault
is encountered.
To allow hash_page() to properly do its job at all time and load the
MMU hash table whenever needed, it must run with data MMU disabled.
This means it must be called before re-enabling data MMU. To allow
this, registers clobbered by hash_page() and create_hpte() have to
be saved in the thread struct together with SRR0, SSR1, DAR and DSISR.
It is also necessary to ensure that DSI prolog doesn't overwrite
regs saved by prolog of the current running exception. That means:
- DSI can only use SPRN_SPRG_SCRATCH0
- Exceptions must free SPRN_SPRG_SCRATCH0 before writing to the stack.
This also fixes the Oops reported by Erhard when create_hpte() is
called by add_hash_page().
Due to prolog size increase, a few more exceptions had to get split
in two parts.
Fixes: cd08f109e262 ("powerpc/32s: Enable CONFIG_VMAP_STACK")
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Erhard F. <erhard_f@mailbox.org>
Tested-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206501
Link: https://lore.kernel.org/r/64a4aa44686e9fd4b01333401367029771d9b231.1581761633.git.christophe.leroy@c-s.fr
2020-02-15 13:14:25 +03:00
# ifdef CONFIG_PPC_BOOK3S_32
OFFSET ( THR0 , thread_struct , r0 ) ;
OFFSET ( THR3 , thread_struct , r3 ) ;
OFFSET ( THR4 , thread_struct , r4 ) ;
OFFSET ( THR5 , thread_struct , r5 ) ;
OFFSET ( THR6 , thread_struct , r6 ) ;
OFFSET ( THR8 , thread_struct , r8 ) ;
OFFSET ( THR9 , thread_struct , r9 ) ;
OFFSET ( THR11 , thread_struct , r11 ) ;
OFFSET ( THLR , thread_struct , lr ) ;
OFFSET ( THCTR , thread_struct , ctr ) ;
# endif
2005-09-26 10:04:21 +04:00
# ifdef CONFIG_SPE
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_EVR0 , thread_struct , evr [ 0 ] ) ;
OFFSET ( THREAD_ACC , thread_struct , acc ) ;
OFFSET ( THREAD_USED_SPE , thread_struct , used_spe ) ;
2005-09-26 10:04:21 +04:00
# endif /* CONFIG_SPE */
2005-09-28 18:35:31 +04:00
# endif /* CONFIG_PPC64 */
2010-04-16 02:11:51 +04:00
# ifdef CONFIG_KVM_BOOK3S_32_HANDLER
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_KVM_SVCPU , thread_struct , kvm_shadow_vcpu ) ;
2010-04-16 02:11:51 +04:00
# endif
2013-01-16 02:20:42 +04:00
# if defined(CONFIG_KVM) && defined(CONFIG_BOOKE)
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_KVM_VCPU , thread_struct , kvm_vcpu ) ;
2011-12-20 19:34:43 +04:00
# endif
2005-09-28 18:35:31 +04:00
2013-02-13 20:21:32 +04:00
# ifdef CONFIG_PPC_TRANSACTIONAL_MEM
2017-02-15 13:41:20 +03:00
OFFSET ( PACATMSCRATCH , paca_struct , tm_scratch ) ;
OFFSET ( THREAD_TM_TFHAR , thread_struct , tm_tfhar ) ;
OFFSET ( THREAD_TM_TEXASR , thread_struct , tm_texasr ) ;
OFFSET ( THREAD_TM_TFIAR , thread_struct , tm_tfiar ) ;
OFFSET ( THREAD_TM_TAR , thread_struct , tm_tar ) ;
OFFSET ( THREAD_TM_PPR , thread_struct , tm_ppr ) ;
OFFSET ( THREAD_TM_DSCR , thread_struct , tm_dscr ) ;
2020-09-19 18:00:25 +03:00
OFFSET ( THREAD_TM_AMR , thread_struct , tm_amr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( PT_CKPT_REGS , thread_struct , ckpt_regs ) ;
2017-05-08 09:23:31 +03:00
OFFSET ( THREAD_CKVRSTATE , thread_struct , ckvr_state . vr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_CKVRSAVE , thread_struct , ckvrsave ) ;
2017-05-08 09:23:31 +03:00
OFFSET ( THREAD_CKFPSTATE , thread_struct , ckfp_state . fpr ) ;
2013-02-13 20:21:32 +04:00
/* Local pt_regs on stack for Transactional Memory funcs. */
DEFINE ( TM_FRAME_SIZE , STACK_FRAME_OVERHEAD +
sizeof ( struct pt_regs ) + 16 ) ;
# endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
2013-02-07 19:46:58 +04:00
2017-02-15 13:41:20 +03:00
OFFSET ( TI_LOCAL_FLAGS , thread_info , local_flags ) ;
2005-09-28 18:35:31 +04:00
# ifdef CONFIG_PPC64
2017-02-15 13:41:20 +03:00
OFFSET ( DCACHEL1BLOCKSIZE , ppc64_caches , l1d . block_size ) ;
OFFSET ( DCACHEL1LOGBLOCKSIZE , ppc64_caches , l1d . log_block_size ) ;
2005-09-28 18:35:31 +04:00
/* paca */
2017-02-15 13:41:20 +03:00
OFFSET ( PACAPACAINDEX , paca_struct , paca_index ) ;
OFFSET ( PACAPROCSTART , paca_struct , cpu_start ) ;
OFFSET ( PACAKSAVE , paca_struct , kstack ) ;
OFFSET ( PACACURRENT , paca_struct , __current ) ;
2019-01-12 12:55:50 +03:00
DEFINE ( PACA_THREAD_INFO , offsetof ( struct paca_struct , __current ) +
offsetof ( struct task_struct , thread_info ) ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( PACASAVEDMSR , paca_struct , saved_msr ) ;
OFFSET ( PACAR1 , paca_struct , saved_r1 ) ;
OFFSET ( PACATOC , paca_struct , kernel_toc ) ;
OFFSET ( PACAKBASE , paca_struct , kernelbase ) ;
OFFSET ( PACAKMSR , paca_struct , kernel_msr ) ;
2017-12-20 06:55:50 +03:00
OFFSET ( PACAIRQSOFTMASK , paca_struct , irq_soft_mask ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( PACAIRQHAPPENED , paca_struct , irq_happened ) ;
2018-04-19 10:04:00 +03:00
OFFSET ( PACA_FTRACE_ENABLED , paca_struct , ftrace_enabled ) ;
2009-07-24 03:15:42 +04:00
# ifdef CONFIG_PPC_BOOK3E
2017-02-15 13:41:20 +03:00
OFFSET ( PACAPGD , paca_struct , pgd ) ;
OFFSET ( PACA_KERNELPGD , paca_struct , kernel_pgd ) ;
OFFSET ( PACA_EXGEN , paca_struct , exgen ) ;
OFFSET ( PACA_EXTLB , paca_struct , extlb ) ;
OFFSET ( PACA_EXMC , paca_struct , exmc ) ;
OFFSET ( PACA_EXCRIT , paca_struct , excrit ) ;
OFFSET ( PACA_EXDBG , paca_struct , exdbg ) ;
OFFSET ( PACA_MC_STACK , paca_struct , mc_kstack ) ;
OFFSET ( PACA_CRIT_STACK , paca_struct , crit_kstack ) ;
OFFSET ( PACA_DBG_STACK , paca_struct , dbg_kstack ) ;
OFFSET ( PACA_TCD_PTR , paca_struct , tcd_ptr ) ;
OFFSET ( TCD_ESEL_NEXT , tlb_core_data , esel_next ) ;
OFFSET ( TCD_ESEL_MAX , tlb_core_data , esel_max ) ;
OFFSET ( TCD_ESEL_FIRST , tlb_core_data , esel_first ) ;
2009-07-24 03:15:42 +04:00
# endif /* CONFIG_PPC_BOOK3E */
2017-10-19 07:08:43 +03:00
# ifdef CONFIG_PPC_BOOK3S_64
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_EXGEN , paca_struct , exgen ) ;
OFFSET ( PACA_EXMC , paca_struct , exmc ) ;
2016-12-19 21:30:04 +03:00
OFFSET ( PACA_EXNMI , paca_struct , exnmi ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_SLBSHADOWPTR , paca_struct , slb_shadow_ptr ) ;
OFFSET ( SLBSHADOW_STACKVSID , slb_shadow , save_area [ SLB_NUM_BOLTED - 1 ] . vsid ) ;
OFFSET ( SLBSHADOW_STACKESID , slb_shadow , save_area [ SLB_NUM_BOLTED - 1 ] . esid ) ;
OFFSET ( SLBSHADOW_SAVEAREA , slb_shadow , save_area ) ;
OFFSET ( LPPACA_PMCINUSE , lppaca , pmcregs_in_use ) ;
2018-02-13 18:08:11 +03:00
# ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
OFFSET ( PACA_PMCINUSE , paca_struct , pmcregs_in_use ) ;
# endif
2017-02-15 13:41:20 +03:00
OFFSET ( LPPACA_YIELDCOUNT , lppaca , yield_count ) ;
2017-10-19 07:08:43 +03:00
# endif /* CONFIG_PPC_BOOK3S_64 */
2017-02-15 13:41:20 +03:00
OFFSET ( PACAEMERGSP , paca_struct , emergency_sp ) ;
powerpc/book3s: handle machine check in Linux host.
Move machine check entry point into Linux. So far we were dependent on
firmware to decode MCE error details and handover the high level info to OS.
This patch introduces early machine check routine that saves the MCE
information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate
stack frame on emergency stack and set the r1 accordingly. This allows us to be
prepared to take another exception without loosing context. One thing to note
here that, if we get another machine check while ME bit is off then we risk a
checkstop. Hence we restrict ourselves to save only MCE information and
register saved on PACA_EXMC save are before we turn the ME bit on. We use
paca->in_mce flag to differentiate between first entry and nested machine check
entry which helps proper use of emergency stack. We increment paca->in_mce
every time we enter in early machine check handler and decrement it while
leaving. When we enter machine check early handler first time (paca->in_mce ==
0), we are sure nobody is using MC emergency stack and allocate a stack frame
at the start of the emergency stack. During subsequent entry (paca->in_mce >
0), we know that r1 points inside emergency stack and we allocate separate
stack frame accordingly. This prevents us from clobbering MCE information
during nested machine checks.
The early machine check handler changes are placed under CPU_FTR_HVMODE
section. This makes sure that the early machine check handler will get executed
only in hypervisor kernel.
This is the code flow:
Machine Check Interrupt
|
V
0x200 vector ME=0, IR=0, DR=0
|
V
+-----------------------------------------------+
|machine_check_pSeries_early: | ME=0, IR=0, DR=0
| Alloc frame on emergency stack |
| Save srr1, srr0, dar and dsisr on stack |
+-----------------------------------------------+
|
(ME=1, IR=0, DR=0, RFID)
|
V
machine_check_handle_early ME=1, IR=0, DR=0
|
V
+-----------------------------------------------+
| machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0
| Things to do: (in next patches) |
| Flush SLB for SLB errors |
| Flush TLB for TLB errors |
| Decode and save MCE info |
+-----------------------------------------------+
|
(Fall through existing exception handler routine.)
|
V
machine_check_pSerie ME=1, IR=0, DR=0
|
(ME=1, IR=1, DR=1, RFID)
|
V
machine_check_common ME=1, IR=1, DR=1
.
.
.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-30 18:34:08 +04:00
# ifdef CONFIG_PPC_BOOK3S_64
2017-02-15 13:41:20 +03:00
OFFSET ( PACAMCEMERGSP , paca_struct , mc_emergency_sp ) ;
2016-12-19 21:30:06 +03:00
OFFSET ( PACA_NMI_EMERG_SP , paca_struct , nmi_emergency_sp ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_IN_MCE , paca_struct , in_mce ) ;
2016-12-19 21:30:05 +03:00
OFFSET ( PACA_IN_NMI , paca_struct , in_nmi ) ;
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 19:07:15 +03:00
OFFSET ( PACA_RFI_FLUSH_FALLBACK_AREA , paca_struct , rfi_flush_fallback_area ) ;
OFFSET ( PACA_EXRFI , paca_struct , exrfi ) ;
2018-01-17 16:58:18 +03:00
OFFSET ( PACA_L1D_FLUSH_SIZE , paca_struct , l1d_flush_size ) ;
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 19:07:15 +03:00
2017-02-15 13:41:20 +03:00
# endif
OFFSET ( PACAHWCPUID , paca_struct , hw_cpu_id ) ;
OFFSET ( PACAKEXECSTATE , paca_struct , kexec_state ) ;
OFFSET ( PACA_DSCR_DEFAULT , paca_struct , dscr_default ) ;
powerpc/64s/exception: remove bad stack branch
The bad stack test in interrupt handlers has a few problems. For
performance it is taken in the common case, which is a fetch bubble
and a waste of i-cache.
For code development and maintainence, it requires yet another stack
frame setup routine, and that constrains all exception handlers to
follow the same register save pattern which inhibits future
optimisation.
Remove the test/branch and replace it with a trap. Teach the program
check handler to use the emergency stack for this case.
This does not result in quite so nice a message, however the SRR0 and
SRR1 of the crashed interrupt can be seen in r11 and r12, as is the
original r1 (adjusted by INT_FRAME_SIZE). These are the most important
parts to debugging the issue.
The original r9-12 and cr0 is lost, which is the main downside.
kernel BUG at linux/arch/powerpc/kernel/exceptions-64s.S:847!
Oops: Exception in kernel mode, sig: 5 [#1]
BE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted
NIP: c000000000009108 LR: c000000000cadbcc CTR: c0000000000090f0
REGS: c0000000fffcbd70 TRAP: 0700 Not tainted
MSR: 9000000000021032 <SF,HV,ME,IR,DR,RI> CR: 28222448 XER: 20040000
CFAR: c000000000009100 IRQMASK: 0
GPR00: 000000000000003d fffffffffffffd00 c0000000018cfb00 c0000000f02b3166
GPR04: fffffffffffffffd 0000000000000007 fffffffffffffffb 0000000000000030
GPR08: 0000000000000037 0000000028222448 0000000000000000 c000000000ca8de0
GPR12: 9000000002009032 c000000001ae0000 c000000000010a00 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: c0000000f00322c0 c000000000f85200 0000000000000004 ffffffffffffffff
GPR24: fffffffffffffffe 0000000000000000 0000000000000000 000000000000000a
GPR28: 0000000000000000 0000000000000000 c0000000f02b391c c0000000f02b3167
NIP [c000000000009108] decrementer_common+0x18/0x160
LR [c000000000cadbcc] .vsnprintf+0x3ec/0x4f0
Call Trace:
Instruction dump:
996d098a 994d098b 38610070 480246ed 48005518 60000000 38200000 718a4000
7c2a0b78 3821fd00 41c20008 e82d0970 <0981fd00> f92101a0 f9610170 f9810178
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-28 09:33:18 +03:00
# ifdef CONFIG_PPC_BOOK3E
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_TRAP_SAVE , paca_struct , trap_save ) ;
powerpc/64s/exception: remove bad stack branch
The bad stack test in interrupt handlers has a few problems. For
performance it is taken in the common case, which is a fetch bubble
and a waste of i-cache.
For code development and maintainence, it requires yet another stack
frame setup routine, and that constrains all exception handlers to
follow the same register save pattern which inhibits future
optimisation.
Remove the test/branch and replace it with a trap. Teach the program
check handler to use the emergency stack for this case.
This does not result in quite so nice a message, however the SRR0 and
SRR1 of the crashed interrupt can be seen in r11 and r12, as is the
original r1 (adjusted by INT_FRAME_SIZE). These are the most important
parts to debugging the issue.
The original r9-12 and cr0 is lost, which is the main downside.
kernel BUG at linux/arch/powerpc/kernel/exceptions-64s.S:847!
Oops: Exception in kernel mode, sig: 5 [#1]
BE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted
NIP: c000000000009108 LR: c000000000cadbcc CTR: c0000000000090f0
REGS: c0000000fffcbd70 TRAP: 0700 Not tainted
MSR: 9000000000021032 <SF,HV,ME,IR,DR,RI> CR: 28222448 XER: 20040000
CFAR: c000000000009100 IRQMASK: 0
GPR00: 000000000000003d fffffffffffffd00 c0000000018cfb00 c0000000f02b3166
GPR04: fffffffffffffffd 0000000000000007 fffffffffffffffb 0000000000000030
GPR08: 0000000000000037 0000000028222448 0000000000000000 c000000000ca8de0
GPR12: 9000000002009032 c000000001ae0000 c000000000010a00 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: c0000000f00322c0 c000000000f85200 0000000000000004 ffffffffffffffff
GPR24: fffffffffffffffe 0000000000000000 0000000000000000 000000000000000a
GPR28: 0000000000000000 0000000000000000 c0000000f02b391c c0000000f02b3167
NIP [c000000000009108] decrementer_common+0x18/0x160
LR [c000000000cadbcc] .vsnprintf+0x3ec/0x4f0
Call Trace:
Instruction dump:
996d098a 994d098b 38610070 480246ed 48005518 60000000 38200000 718a4000
7c2a0b78 3821fd00 41c20008 e82d0970 <0981fd00> f92101a0 f9610170 f9810178
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-28 09:33:18 +03:00
# endif
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_SPRG_VDSO , paca_struct , sprg_vdso ) ;
2016-05-17 09:33:46 +03:00
# else /* CONFIG_PPC64 */
2005-10-26 11:05:24 +04:00
# endif /* CONFIG_PPC64 */
2005-09-28 18:35:31 +04:00
/* RTAS */
2017-02-15 13:41:20 +03:00
OFFSET ( RTASBASE , rtas_t , base ) ;
OFFSET ( RTASENTRY , rtas_t , entry ) ;
2005-09-28 18:35:31 +04:00
2005-09-26 10:04:21 +04:00
/* Interrupt register frame */
2008-04-24 00:33:49 +04:00
DEFINE ( INT_FRAME_SIZE , STACK_INT_FRAME_SIZE ) ;
powerpc/64: Fix stack trace not displaying final frame
In commit bf13718bc57a ("powerpc: show registers when unwinding
interrupt frames") we changed our stack dumping logic to show the full
registers whenever we find an interrupt frame on the stack.
However we didn't notice that on 64-bit this doesn't show the final
frame, ie. the interrupt that brought us in from userspace, whereas on
32-bit it does.
That is due to confusion about the size of that last frame. The code
in show_stack() calls validate_sp(), passing it STACK_INT_FRAME_SIZE
to check the sp is at least that far below the top of the stack.
However on 64-bit that size is too large for the final frame, because
it includes the red zone, but we don't allocate a red zone for the
first frame.
So add a new define that encodes the correct size for 32-bit and
64-bit, and use it in show_stack().
This results in the full trace being shown on 64-bit, eg:
sysrq: Trigger a crash
Kernel panic - not syncing: sysrq triggered crash
CPU: 0 PID: 83 Comm: sh Not tainted 5.11.0-rc2-gcc-8.2.0-00188-g571abcb96b10-dirty #649
Call Trace:
[c00000000a1c3ac0] [c000000000897b70] dump_stack+0xc4/0x114 (unreliable)
[c00000000a1c3b00] [c00000000014334c] panic+0x178/0x41c
[c00000000a1c3ba0] [c00000000094e600] sysrq_handle_crash+0x40/0x50
[c00000000a1c3c00] [c00000000094ef98] __handle_sysrq+0xd8/0x210
[c00000000a1c3ca0] [c00000000094f820] write_sysrq_trigger+0x100/0x188
[c00000000a1c3ce0] [c0000000005559dc] proc_reg_write+0x10c/0x1b0
[c00000000a1c3d10] [c000000000479950] vfs_write+0xf0/0x360
[c00000000a1c3d60] [c000000000479d9c] ksys_write+0x7c/0x140
[c00000000a1c3db0] [c00000000002bf5c] system_call_exception+0x19c/0x2c0
[c00000000a1c3e10] [c00000000000d35c] system_call_common+0xec/0x278
--- interrupt: c00 at 0x7fff9fbab428
NIP: 00007fff9fbab428 LR: 000000001000b724 CTR: 0000000000000000
REGS: c00000000a1c3e80 TRAP: 0c00 Not tainted (5.11.0-rc2-gcc-8.2.0-00188-g571abcb96b10-dirty)
MSR: 900000000280f033 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 22002884 XER: 00000000
IRQMASK: 0
GPR00: 0000000000000004 00007fffc3cb8960 00007fff9fc59900 0000000000000001
GPR04: 000000002a4b32d0 0000000000000002 0000000000000063 0000000000000063
GPR08: 000000002a4b32d0 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff9fcca9a0 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 00000000100b8fd0
GPR20: 000000002a4b3485 00000000100b8f90 0000000000000000 0000000000000000
GPR24: 000000002a4b0440 00000000100e77b8 0000000000000020 000000002a4b32d0
GPR28: 0000000000000001 0000000000000002 000000002a4b32d0 0000000000000001
NIP [00007fff9fbab428] 0x7fff9fbab428
LR [000000001000b724] 0x1000b724
--- interrupt: c00
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210209141627.2898485-1-mpe@ellerman.id.au
2021-02-09 16:59:20 +03:00
DEFINE ( SWITCH_FRAME_SIZE , STACK_FRAME_WITH_PT_REGS ) ;
2016-06-02 07:29:47 +03:00
STACK_PT_REGS_OFFSET ( GPR0 , gpr [ 0 ] ) ;
STACK_PT_REGS_OFFSET ( GPR1 , gpr [ 1 ] ) ;
STACK_PT_REGS_OFFSET ( GPR2 , gpr [ 2 ] ) ;
STACK_PT_REGS_OFFSET ( GPR3 , gpr [ 3 ] ) ;
STACK_PT_REGS_OFFSET ( GPR4 , gpr [ 4 ] ) ;
STACK_PT_REGS_OFFSET ( GPR5 , gpr [ 5 ] ) ;
STACK_PT_REGS_OFFSET ( GPR6 , gpr [ 6 ] ) ;
STACK_PT_REGS_OFFSET ( GPR7 , gpr [ 7 ] ) ;
STACK_PT_REGS_OFFSET ( GPR8 , gpr [ 8 ] ) ;
STACK_PT_REGS_OFFSET ( GPR9 , gpr [ 9 ] ) ;
STACK_PT_REGS_OFFSET ( GPR10 , gpr [ 10 ] ) ;
STACK_PT_REGS_OFFSET ( GPR11 , gpr [ 11 ] ) ;
STACK_PT_REGS_OFFSET ( GPR12 , gpr [ 12 ] ) ;
STACK_PT_REGS_OFFSET ( GPR13 , gpr [ 13 ] ) ;
2005-09-26 10:04:21 +04:00
/*
* Note : these symbols include _ because they overlap with special
* register names
*/
2016-06-02 07:29:47 +03:00
STACK_PT_REGS_OFFSET ( _NIP , nip ) ;
STACK_PT_REGS_OFFSET ( _MSR , msr ) ;
STACK_PT_REGS_OFFSET ( _CTR , ctr ) ;
STACK_PT_REGS_OFFSET ( _LINK , link ) ;
STACK_PT_REGS_OFFSET ( _CCR , ccr ) ;
STACK_PT_REGS_OFFSET ( _XER , xer ) ;
STACK_PT_REGS_OFFSET ( _DAR , dar ) ;
STACK_PT_REGS_OFFSET ( _DSISR , dsisr ) ;
STACK_PT_REGS_OFFSET ( ORIG_GPR3 , orig_gpr3 ) ;
STACK_PT_REGS_OFFSET ( RESULT , result ) ;
STACK_PT_REGS_OFFSET ( _TRAP , trap ) ;
2005-09-28 18:35:31 +04:00
# ifndef CONFIG_PPC64
/*
* The PowerPC 400 - class & Book - E processors have neither the DAR
* nor the DSISR SPRs . Hence , we overload them to hold the similar
* DEAR and ESR SPRs for such processors . For critical interrupts
* we use them to hold SRR0 and SRR1 .
2005-09-26 10:04:21 +04:00
*/
2016-06-02 07:29:47 +03:00
STACK_PT_REGS_OFFSET ( _DEAR , dar ) ;
STACK_PT_REGS_OFFSET ( _ESR , dsisr ) ;
2005-09-28 18:35:31 +04:00
# else /* CONFIG_PPC64 */
2016-06-02 07:29:47 +03:00
STACK_PT_REGS_OFFSET ( SOFTE , softe ) ;
2018-10-12 16:15:16 +03:00
STACK_PT_REGS_OFFSET ( _PPR , ppr ) ;
2005-09-28 18:35:31 +04:00
# endif /* CONFIG_PPC64 */
2020-11-27 07:44:05 +03:00
# ifdef CONFIG_PPC_PKEY
STACK_PT_REGS_OFFSET ( STACK_REGS_AMR , amr ) ;
2020-11-27 07:44:12 +03:00
STACK_PT_REGS_OFFSET ( STACK_REGS_IAMR , iamr ) ;
2020-11-27 07:44:05 +03:00
# endif
2020-11-27 07:44:12 +03:00
2009-07-28 05:59:34 +04:00
# if defined(CONFIG_PPC32)
2008-04-30 14:23:21 +04:00
# if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
DEFINE ( EXC_LVL_SIZE , STACK_EXC_LVL_FRAME_SIZE ) ;
DEFINE ( MAS0 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas0 ) ) ;
/* we overload MMUCR for 44x on MAS0 since they are mutually exclusive */
DEFINE ( MMUCR , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas0 ) ) ;
DEFINE ( MAS1 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas1 ) ) ;
DEFINE ( MAS2 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas2 ) ) ;
DEFINE ( MAS3 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas3 ) ) ;
DEFINE ( MAS6 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas6 ) ) ;
DEFINE ( MAS7 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas7 ) ) ;
DEFINE ( _SRR0 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , srr0 ) ) ;
DEFINE ( _SRR1 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , srr1 ) ) ;
DEFINE ( _CSRR0 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , csrr0 ) ) ;
DEFINE ( _CSRR1 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , csrr1 ) ) ;
DEFINE ( _DSRR0 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , dsrr0 ) ) ;
DEFINE ( _DSRR1 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , dsrr1 ) ) ;
# endif
2009-07-28 05:59:34 +04:00
# endif
2005-09-28 18:35:31 +04:00
2005-09-26 10:04:21 +04:00
/* About the CPU features table */
2017-02-15 13:41:20 +03:00
OFFSET ( CPU_SPEC_FEATURES , cpu_spec , cpu_features ) ;
OFFSET ( CPU_SPEC_SETUP , cpu_spec , cpu_setup ) ;
OFFSET ( CPU_SPEC_RESTORE , cpu_spec , cpu_restore ) ;
2005-09-26 10:04:21 +04:00
2017-02-15 13:41:20 +03:00
OFFSET ( pbe_address , pbe , address ) ;
OFFSET ( pbe_orig_address , pbe , orig_address ) ;
OFFSET ( pbe_next , pbe , next ) ;
2005-09-26 10:04:21 +04:00
2007-05-03 16:31:38 +04:00
# ifndef CONFIG_PPC64
2005-10-11 16:08:12 +04:00
DEFINE ( TASK_SIZE , TASK_SIZE ) ;
2005-09-28 18:35:31 +04:00
DEFINE ( NUM_USER_SEGMENTS , TASK_SIZE > > 28 ) ;
2005-11-11 13:15:21 +03:00
# endif /* ! CONFIG_PPC64 */
2005-09-26 10:04:21 +04:00
2005-11-11 13:15:21 +03:00
/* datapage offsets for use by vdso */
2020-11-26 16:10:05 +03:00
OFFSET ( VDSO_DATA_OFFSET , vdso_arch_data , data ) ;
OFFSET ( CFG_TB_TICKS_PER_SEC , vdso_arch_data , tb_ticks_per_sec ) ;
2019-12-02 10:57:31 +03:00
# ifdef CONFIG_PPC64
2020-11-26 16:10:05 +03:00
OFFSET ( CFG_ICACHE_BLOCKSZ , vdso_arch_data , icache_block_size ) ;
OFFSET ( CFG_DCACHE_BLOCKSZ , vdso_arch_data , dcache_block_size ) ;
OFFSET ( CFG_ICACHE_LOGBLOCKSZ , vdso_arch_data , icache_log_block_size ) ;
OFFSET ( CFG_DCACHE_LOGBLOCKSZ , vdso_arch_data , dcache_log_block_size ) ;
2020-09-27 12:16:20 +03:00
OFFSET ( CFG_SYSCALL_MAP64 , vdso_arch_data , syscall_map ) ;
OFFSET ( CFG_SYSCALL_MAP32 , vdso_arch_data , compat_syscall_map ) ;
# else
OFFSET ( CFG_SYSCALL_MAP32 , vdso_arch_data , syscall_map ) ;
2020-11-26 16:10:05 +03:00
# endif
2005-11-11 13:15:21 +03:00
2007-01-01 21:45:34 +03:00
# ifdef CONFIG_BUG
DEFINE ( BUG_ENTRY_SIZE , sizeof ( struct bug_entry ) ) ;
# endif
2007-08-20 08:58:36 +04:00
2017-04-12 07:56:36 +03:00
# ifdef CONFIG_PPC_BOOK3S_64
DEFINE ( PGD_TABLE_SIZE , ( sizeof ( pgd_t ) < < max ( RADIX_PGD_INDEX_SIZE , H_PGD_INDEX_SIZE ) ) ) ;
2016-04-29 16:25:49 +03:00
# else
2007-09-18 11:22:59 +04:00
DEFINE ( PGD_TABLE_SIZE , PGD_TABLE_SIZE ) ;
2016-04-29 16:25:49 +03:00
# endif
2008-09-24 20:01:24 +04:00
DEFINE ( PTE_SIZE , sizeof ( pte_t ) ) ;
2007-12-06 22:11:04 +03:00
2008-04-17 08:28:09 +04:00
# ifdef CONFIG_KVM
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_HOST_STACK , kvm_vcpu , arch . host_stack ) ;
OFFSET ( VCPU_HOST_PID , kvm_vcpu , arch . host_pid ) ;
OFFSET ( VCPU_GUEST_PID , kvm_vcpu , arch . pid ) ;
2018-05-07 09:20:07 +03:00
OFFSET ( VCPU_GPRS , kvm_vcpu , arch . regs . gpr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_VRSAVE , kvm_vcpu , arch . vrsave ) ;
OFFSET ( VCPU_FPRS , kvm_vcpu , arch . fp . fpr ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
# ifdef CONFIG_ALTIVEC
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_VRS , kvm_vcpu , arch . vr . vr ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
# endif
2018-05-07 09:20:08 +03:00
OFFSET ( VCPU_XER , kvm_vcpu , arch . regs . xer ) ;
OFFSET ( VCPU_CTR , kvm_vcpu , arch . regs . ctr ) ;
OFFSET ( VCPU_LR , kvm_vcpu , arch . regs . link ) ;
2014-04-22 14:26:58 +04:00
# ifdef CONFIG_PPC_BOOK3S
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_TAR , kvm_vcpu , arch . tar ) ;
2014-04-22 14:26:58 +04:00
# endif
2018-10-08 08:30:58 +03:00
OFFSET ( VCPU_CR , kvm_vcpu , arch . regs . ccr ) ;
2018-05-07 09:20:08 +03:00
OFFSET ( VCPU_PC , kvm_vcpu , arch . regs . nip ) ;
2013-10-07 20:47:52 +04:00
# ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_MSR , kvm_vcpu , arch . shregs . msr ) ;
OFFSET ( VCPU_SRR0 , kvm_vcpu , arch . shregs . srr0 ) ;
OFFSET ( VCPU_SRR1 , kvm_vcpu , arch . shregs . srr1 ) ;
OFFSET ( VCPU_SPRG0 , kvm_vcpu , arch . shregs . sprg0 ) ;
OFFSET ( VCPU_SPRG1 , kvm_vcpu , arch . shregs . sprg1 ) ;
OFFSET ( VCPU_SPRG2 , kvm_vcpu , arch . shregs . sprg2 ) ;
OFFSET ( VCPU_SPRG3 , kvm_vcpu , arch . shregs . sprg3 ) ;
KVM: PPC: Book3S HV: Accumulate timing information for real-mode code
This reads the timebase at various points in the real-mode guest
entry/exit code and uses that to accumulate total, minimum and
maximum time spent in those parts of the code. Currently these
times are accumulated per vcpu in 5 parts of the code:
* rm_entry - time taken from the start of kvmppc_hv_entry() until
just before entering the guest.
* rm_intr - time from when we take a hypervisor interrupt in the
guest until we either re-enter the guest or decide to exit to the
host. This includes time spent handling hcalls in real mode.
* rm_exit - time from when we decide to exit the guest until the
return from kvmppc_hv_entry().
* guest - time spend in the guest
* cede - time spent napping in real mode due to an H_CEDE hcall
while other threads in the same vcore are active.
These times are exposed in debugfs in a directory per vcpu that
contains a file called "timings". This file contains one line for
each of the 5 timings above, with the name followed by a colon and
4 numbers, which are the count (number of times the code has been
executed), the total time, the minimum time, and the maximum time,
all in nanoseconds.
The overhead of the extra code amounts to about 30ns for an hcall that
is handled in real mode (e.g. H_SET_DABR), which is about 25%. Since
production environments may not wish to incur this overhead, the new
code is conditional on a new config symbol,
CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-28 06:21:02 +03:00
# endif
# ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_TB_RMENTRY , kvm_vcpu , arch . rm_entry ) ;
OFFSET ( VCPU_TB_RMINTR , kvm_vcpu , arch . rm_intr ) ;
OFFSET ( VCPU_TB_RMEXIT , kvm_vcpu , arch . rm_exit ) ;
OFFSET ( VCPU_TB_GUEST , kvm_vcpu , arch . guest_time ) ;
OFFSET ( VCPU_TB_CEDE , kvm_vcpu , arch . cede_time ) ;
OFFSET ( VCPU_CUR_ACTIVITY , kvm_vcpu , arch . cur_activity ) ;
OFFSET ( VCPU_ACTIVITY_START , kvm_vcpu , arch . cur_tb_start ) ;
OFFSET ( TAS_SEQCOUNT , kvmhv_tb_accumulator , seqcount ) ;
OFFSET ( TAS_TOTAL , kvmhv_tb_accumulator , tb_total ) ;
OFFSET ( TAS_MIN , kvmhv_tb_accumulator , tb_min ) ;
OFFSET ( TAS_MAX , kvmhv_tb_accumulator , tb_max ) ;
# endif
OFFSET ( VCPU_SHARED_SPRG3 , kvm_vcpu_arch_shared , sprg3 ) ;
OFFSET ( VCPU_SHARED_SPRG4 , kvm_vcpu_arch_shared , sprg4 ) ;
OFFSET ( VCPU_SHARED_SPRG5 , kvm_vcpu_arch_shared , sprg5 ) ;
OFFSET ( VCPU_SHARED_SPRG6 , kvm_vcpu_arch_shared , sprg6 ) ;
OFFSET ( VCPU_SHARED_SPRG7 , kvm_vcpu_arch_shared , sprg7 ) ;
OFFSET ( VCPU_SHADOW_PID , kvm_vcpu , arch . shadow_pid ) ;
OFFSET ( VCPU_SHADOW_PID1 , kvm_vcpu , arch . shadow_pid1 ) ;
OFFSET ( VCPU_SHARED , kvm_vcpu , arch . shared ) ;
OFFSET ( VCPU_SHARED_MSR , kvm_vcpu_arch_shared , msr ) ;
OFFSET ( VCPU_SHADOW_MSR , kvm_vcpu , arch . shadow_msr ) ;
2014-04-24 15:46:24 +04:00
# if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_PR_POSSIBLE)
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_SHAREDBE , kvm_vcpu , arch . shared_big_endian ) ;
2014-04-24 15:46:24 +04:00
# endif
2008-04-17 08:28:09 +04:00
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_SHARED_MAS0 , kvm_vcpu_arch_shared , mas0 ) ;
OFFSET ( VCPU_SHARED_MAS1 , kvm_vcpu_arch_shared , mas1 ) ;
OFFSET ( VCPU_SHARED_MAS2 , kvm_vcpu_arch_shared , mas2 ) ;
OFFSET ( VCPU_SHARED_MAS7_3 , kvm_vcpu_arch_shared , mas7_3 ) ;
OFFSET ( VCPU_SHARED_MAS4 , kvm_vcpu_arch_shared , mas4 ) ;
OFFSET ( VCPU_SHARED_MAS6 , kvm_vcpu_arch_shared , mas6 ) ;
KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn
This allows additional registers to be accessed by the guest
in PR-mode KVM without trapping.
SPRG4-7 are readable from userspace. On booke, KVM will sync
these registers when it enters the guest, so that accesses from
guest userspace will work. The guest kernel, OTOH, must consistently
use either the real registers or the shared area between exits. This
also applies to the already-paravirted SPRG3.
On non-booke, it's not clear to what extent SPRG4-7 are supported
(they're not architected for book3s, but exist on at least some classic
chips). They are copied in the get/set regs ioctls, but I do not see any
non-booke emulation. I also do not see any syncing with real registers
(in PR-mode) including the user-readable SPRG3. This patch should not
make that situation any worse.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-11-09 04:23:30 +04:00
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_KVM , kvm_vcpu , kvm ) ;
OFFSET ( KVM_LPID , kvm , arch . lpid ) ;
2011-12-20 19:34:43 +04:00
2010-04-16 02:11:42 +04:00
/* book3s */
2013-10-07 20:47:52 +04:00
# ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
2017-02-15 13:41:20 +03:00
OFFSET ( KVM_TLB_SETS , kvm , arch . tlb_sets ) ;
OFFSET ( KVM_SDR1 , kvm , arch . sdr1 ) ;
OFFSET ( KVM_HOST_LPID , kvm , arch . host_lpid ) ;
OFFSET ( KVM_HOST_LPCR , kvm , arch . host_lpcr ) ;
OFFSET ( KVM_HOST_SDR1 , kvm , arch . host_sdr1 ) ;
OFFSET ( KVM_ENABLED_HCALLS , kvm , arch . enabled_hcalls ) ;
OFFSET ( KVM_VRMA_SLB_V , kvm , arch . vrma_slb_v ) ;
OFFSET ( KVM_RADIX , kvm , arch . radix ) ;
2019-08-22 06:48:38 +03:00
OFFSET ( KVM_SECURE_GUEST , kvm , arch . secure_guest ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_DSISR , kvm_vcpu , arch . shregs . dsisr ) ;
OFFSET ( VCPU_DAR , kvm_vcpu , arch . shregs . dar ) ;
OFFSET ( VCPU_VPA , kvm_vcpu , arch . vpa . pinned_addr ) ;
OFFSET ( VCPU_VPA_DIRTY , kvm_vcpu , arch . vpa . dirty ) ;
OFFSET ( VCPU_HEIR , kvm_vcpu , arch . emul_inst ) ;
2018-10-08 08:31:04 +03:00
OFFSET ( VCPU_NESTED , kvm_vcpu , arch . nested ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_CPU , kvm_vcpu , cpu ) ;
OFFSET ( VCPU_THREAD_CPU , kvm_vcpu , arch . thread_cpu ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
# endif
2010-04-16 02:11:42 +04:00
# ifdef CONFIG_PPC_BOOK3S
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_PURR , kvm_vcpu , arch . purr ) ;
OFFSET ( VCPU_SPURR , kvm_vcpu , arch . spurr ) ;
OFFSET ( VCPU_IC , kvm_vcpu , arch . ic ) ;
OFFSET ( VCPU_DSCR , kvm_vcpu , arch . dscr ) ;
OFFSET ( VCPU_AMR , kvm_vcpu , arch . amr ) ;
OFFSET ( VCPU_UAMOR , kvm_vcpu , arch . uamor ) ;
OFFSET ( VCPU_IAMR , kvm_vcpu , arch . iamr ) ;
OFFSET ( VCPU_CTRL , kvm_vcpu , arch . ctrl ) ;
OFFSET ( VCPU_DABR , kvm_vcpu , arch . dabr ) ;
OFFSET ( VCPU_DABRX , kvm_vcpu , arch . dabrx ) ;
2020-12-16 13:42:17 +03:00
OFFSET ( VCPU_DAWR0 , kvm_vcpu , arch . dawr0 ) ;
OFFSET ( VCPU_DAWRX0 , kvm_vcpu , arch . dawrx0 ) ;
2020-12-16 13:42:18 +03:00
OFFSET ( VCPU_DAWR1 , kvm_vcpu , arch . dawr1 ) ;
OFFSET ( VCPU_DAWRX1 , kvm_vcpu , arch . dawrx1 ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_CIABR , kvm_vcpu , arch . ciabr ) ;
OFFSET ( VCPU_HFLAGS , kvm_vcpu , arch . hflags ) ;
OFFSET ( VCPU_DEC_EXPIRES , kvm_vcpu , arch . dec_expires ) ;
OFFSET ( VCPU_PENDING_EXC , kvm_vcpu , arch . pending_exceptions ) ;
OFFSET ( VCPU_CEDED , kvm_vcpu , arch . ceded ) ;
OFFSET ( VCPU_PRODDED , kvm_vcpu , arch . prodded ) ;
2018-01-12 05:37:13 +03:00
OFFSET ( VCPU_IRQ_PENDING , kvm_vcpu , arch . irq_pending ) ;
KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9
On POWER9, we no longer have the restriction that we had on POWER8
where all threads in a core have to be in the same partition, so
the CPU threads are now independent. However, we still want to be
able to run guests with a virtual SMT topology, if only to allow
migration of guests from POWER8 systems to POWER9.
A guest that has a virtual SMT mode greater than 1 will expect to
be able to use the doorbell facility; it will expect the msgsndp
and msgclrp instructions to work appropriately and to be able to read
sensible values from the TIR (thread identification register) and
DPDES (directed privileged doorbell exception status) special-purpose
registers. However, since each CPU thread is a separate sub-processor
in POWER9, these instructions and registers can only be used within
a single CPU thread.
In order for these instructions to appear to act correctly according
to the guest's virtual SMT mode, we have to trap and emulate them.
We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
register. The emulation is triggered by the hypervisor facility
unavailable interrupt that occurs when the guest uses them.
To cause a doorbell interrupt to occur within the guest, we set the
DPDES register to 1. If the guest has interrupts enabled, the CPU
will generate a doorbell interrupt and clear the DPDES register in
hardware. The DPDES hardware register for the guest is saved in the
vcpu->arch.vcore->dpdes field. Since this gets written by the guest
exit code, other VCPUs wishing to cause a doorbell interrupt don't
write that field directly, but instead set a vcpu->arch.doorbell_request
flag. This is consumed and set to 0 by the guest entry code, which
then sets DPDES to 1.
Emulating reads of the DPDES register is somewhat involved, because
it requires reading the doorbell pending interrupt status of all of the
VCPU threads in the virtual core, and if any of those VCPUs are
running, their doorbell status is only up-to-date in the hardware
DPDES registers of the CPUs where they are running. In order to get
a reasonable approximation of the current doorbell status, we send
those CPUs an IPI, causing an exit from the guest which will update
the vcpu->arch.vcore->dpdes field. We then use that value in
constructing the emulated DPDES register value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-05-16 09:41:20 +03:00
OFFSET ( VCPU_DBELL_REQ , kvm_vcpu , arch . doorbell_request ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_MMCR , kvm_vcpu , arch . mmcr ) ;
2020-07-17 17:38:14 +03:00
OFFSET ( VCPU_MMCRA , kvm_vcpu , arch . mmcra ) ;
OFFSET ( VCPU_MMCRS , kvm_vcpu , arch . mmcrs ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_PMC , kvm_vcpu , arch . pmc ) ;
OFFSET ( VCPU_SIAR , kvm_vcpu , arch . siar ) ;
OFFSET ( VCPU_SDAR , kvm_vcpu , arch . sdar ) ;
OFFSET ( VCPU_SIER , kvm_vcpu , arch . sier ) ;
OFFSET ( VCPU_SLB , kvm_vcpu , arch . slb ) ;
OFFSET ( VCPU_SLB_MAX , kvm_vcpu , arch . slb_max ) ;
OFFSET ( VCPU_SLB_NR , kvm_vcpu , arch . slb_nr ) ;
OFFSET ( VCPU_FAULT_DSISR , kvm_vcpu , arch . fault_dsisr ) ;
OFFSET ( VCPU_FAULT_DAR , kvm_vcpu , arch . fault_dar ) ;
OFFSET ( VCPU_FAULT_GPA , kvm_vcpu , arch . fault_gpa ) ;
OFFSET ( VCPU_INTR_MSR , kvm_vcpu , arch . intr_msr ) ;
OFFSET ( VCPU_LAST_INST , kvm_vcpu , arch . last_inst ) ;
OFFSET ( VCPU_TRAP , kvm_vcpu , arch . trap ) ;
OFFSET ( VCPU_CFAR , kvm_vcpu , arch . cfar ) ;
OFFSET ( VCPU_PPR , kvm_vcpu , arch . ppr ) ;
OFFSET ( VCPU_FSCR , kvm_vcpu , arch . fscr ) ;
OFFSET ( VCPU_PSPB , kvm_vcpu , arch . pspb ) ;
OFFSET ( VCPU_EBBHR , kvm_vcpu , arch . ebbhr ) ;
OFFSET ( VCPU_EBBRR , kvm_vcpu , arch . ebbrr ) ;
OFFSET ( VCPU_BESCR , kvm_vcpu , arch . bescr ) ;
OFFSET ( VCPU_CSIGR , kvm_vcpu , arch . csigr ) ;
OFFSET ( VCPU_TACR , kvm_vcpu , arch . tacr ) ;
OFFSET ( VCPU_TCSCR , kvm_vcpu , arch . tcscr ) ;
OFFSET ( VCPU_ACOP , kvm_vcpu , arch . acop ) ;
OFFSET ( VCPU_WORT , kvm_vcpu , arch . wort ) ;
OFFSET ( VCPU_TID , kvm_vcpu , arch . tid ) ;
OFFSET ( VCPU_PSSCR , kvm_vcpu , arch . psscr ) ;
2017-02-15 06:30:17 +03:00
OFFSET ( VCPU_HFSCR , kvm_vcpu , arch . hfscr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCORE_ENTRY_EXIT , kvmppc_vcore , entry_exit_map ) ;
OFFSET ( VCORE_IN_GUEST , kvmppc_vcore , in_guest ) ;
OFFSET ( VCORE_NAPPING_THREADS , kvmppc_vcore , napping_threads ) ;
OFFSET ( VCORE_KVM , kvmppc_vcore , kvm ) ;
OFFSET ( VCORE_TB_OFFSET , kvmppc_vcore , tb_offset ) ;
KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry
Currently, the HV KVM guest entry/exit code adds the timebase offset
from the vcore struct to the timebase on guest entry, and subtracts
it on guest exit. Which is fine, except that it is possible for
userspace to change the offset using the SET_ONE_REG interface while
the vcore is running, as there is only one timebase offset per vcore
but potentially multiple VCPUs in the vcore. If that were to happen,
KVM would subtract a different offset on guest exit from that which
it had added on guest entry, leading to the timebase being out of sync
between cores in the host, which then leads to bad things happening
such as hangs and spurious watchdog timeouts.
To fix this, we add a new field 'tb_offset_applied' to the vcore struct
which stores the offset that is currently applied to the timebase.
This value is set from the vcore tb_offset field on guest entry, and
is what is subtracted from the timebase on guest exit. Since it is
zero when the timebase offset is not applied, we can simplify the
logic in kvmhv_start_timing and kvmhv_accumulate_time.
In addition, we had secondary threads reading the timebase while
running concurrently with code on the primary thread which would
eventually add or subtract the timebase offset from the timebase.
This occurred while saving or restoring the DEC register value on
the secondary threads. Although no specific incorrect behaviour has
been observed, this is a race which should be fixed. To fix it, we
move the DEC saving code to just before we call kvmhv_commence_exit,
and the DEC restoring code to after the point where we have waited
for the primary thread to switch the MMU context and add the timebase
offset. That way we are sure that the timebase contains the guest
timebase value in both cases.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-04-20 15:51:11 +03:00
OFFSET ( VCORE_TB_OFFSET_APPL , kvmppc_vcore , tb_offset_applied ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCORE_LPCR , kvmppc_vcore , lpcr ) ;
OFFSET ( VCORE_PCR , kvmppc_vcore , pcr ) ;
OFFSET ( VCORE_DPDES , kvmppc_vcore , dpdes ) ;
OFFSET ( VCORE_VTB , kvmppc_vcore , vtb ) ;
OFFSET ( VCPU_SLB_E , kvmppc_slb , orige ) ;
OFFSET ( VCPU_SLB_V , kvmppc_slb , origv ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
DEFINE ( VCPU_SLB_SIZE , sizeof ( struct kvmppc_slb ) ) ;
2014-01-08 14:25:32 +04:00
# ifdef CONFIG_PPC_TRANSACTIONAL_MEM
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_TFHAR , kvm_vcpu , arch . tfhar ) ;
OFFSET ( VCPU_TFIAR , kvm_vcpu , arch . tfiar ) ;
OFFSET ( VCPU_TEXASR , kvm_vcpu , arch . texasr ) ;
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 13:32:01 +03:00
OFFSET ( VCPU_ORIG_TEXASR , kvm_vcpu , arch . orig_texasr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_GPR_TM , kvm_vcpu , arch . gpr_tm ) ;
OFFSET ( VCPU_FPRS_TM , kvm_vcpu , arch . fp_tm . fpr ) ;
OFFSET ( VCPU_VRS_TM , kvm_vcpu , arch . vr_tm . vr ) ;
OFFSET ( VCPU_VRSAVE_TM , kvm_vcpu , arch . vrsave_tm ) ;
OFFSET ( VCPU_CR_TM , kvm_vcpu , arch . cr_tm ) ;
OFFSET ( VCPU_XER_TM , kvm_vcpu , arch . xer_tm ) ;
OFFSET ( VCPU_LR_TM , kvm_vcpu , arch . lr_tm ) ;
OFFSET ( VCPU_CTR_TM , kvm_vcpu , arch . ctr_tm ) ;
OFFSET ( VCPU_AMR_TM , kvm_vcpu , arch . amr_tm ) ;
OFFSET ( VCPU_PPR_TM , kvm_vcpu , arch . ppr_tm ) ;
OFFSET ( VCPU_DSCR_TM , kvm_vcpu , arch . dscr_tm ) ;
OFFSET ( VCPU_TAR_TM , kvm_vcpu , arch . tar_tm ) ;
2014-01-08 14:25:32 +04:00
# endif
2011-06-29 04:20:58 +04:00
# ifdef CONFIG_PPC_BOOK3S_64
2013-10-07 20:47:51 +04:00
# ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_SVCPU , paca_struct , shadow_vcpu ) ;
2011-06-29 04:20:58 +04:00
# define SVCPU_FIELD(x, f) DEFINE(x, offsetof(struct paca_struct, shadow_vcpu.f))
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
# else
# define SVCPU_FIELD(x, f)
# endif
2011-06-29 04:20:58 +04:00
# define HSTATE_FIELD(x, f) DEFINE(x, offsetof(struct paca_struct, kvm_hstate.f))
# else /* 32-bit */
# define SVCPU_FIELD(x, f) DEFINE(x, offsetof(struct kvmppc_book3s_shadow_vcpu, f))
# define HSTATE_FIELD(x, f) DEFINE(x, offsetof(struct kvmppc_book3s_shadow_vcpu, hstate.f))
# endif
SVCPU_FIELD ( SVCPU_CR , cr ) ;
SVCPU_FIELD ( SVCPU_XER , xer ) ;
SVCPU_FIELD ( SVCPU_CTR , ctr ) ;
SVCPU_FIELD ( SVCPU_LR , lr ) ;
SVCPU_FIELD ( SVCPU_PC , pc ) ;
SVCPU_FIELD ( SVCPU_R0 , gpr [ 0 ] ) ;
SVCPU_FIELD ( SVCPU_R1 , gpr [ 1 ] ) ;
SVCPU_FIELD ( SVCPU_R2 , gpr [ 2 ] ) ;
SVCPU_FIELD ( SVCPU_R3 , gpr [ 3 ] ) ;
SVCPU_FIELD ( SVCPU_R4 , gpr [ 4 ] ) ;
SVCPU_FIELD ( SVCPU_R5 , gpr [ 5 ] ) ;
SVCPU_FIELD ( SVCPU_R6 , gpr [ 6 ] ) ;
SVCPU_FIELD ( SVCPU_R7 , gpr [ 7 ] ) ;
SVCPU_FIELD ( SVCPU_R8 , gpr [ 8 ] ) ;
SVCPU_FIELD ( SVCPU_R9 , gpr [ 9 ] ) ;
SVCPU_FIELD ( SVCPU_R10 , gpr [ 10 ] ) ;
SVCPU_FIELD ( SVCPU_R11 , gpr [ 11 ] ) ;
SVCPU_FIELD ( SVCPU_R12 , gpr [ 12 ] ) ;
SVCPU_FIELD ( SVCPU_R13 , gpr [ 13 ] ) ;
SVCPU_FIELD ( SVCPU_FAULT_DSISR , fault_dsisr ) ;
SVCPU_FIELD ( SVCPU_FAULT_DAR , fault_dar ) ;
SVCPU_FIELD ( SVCPU_LAST_INST , last_inst ) ;
SVCPU_FIELD ( SVCPU_SHADOW_SRR1 , shadow_srr1 ) ;
2010-04-16 02:11:44 +04:00
# ifdef CONFIG_PPC_BOOK3S_32
2011-06-29 04:20:58 +04:00
SVCPU_FIELD ( SVCPU_SR , sr ) ;
2010-04-16 02:11:44 +04:00
# endif
2011-06-29 04:20:58 +04:00
# ifdef CONFIG_PPC64
SVCPU_FIELD ( SVCPU_SLB , slb ) ;
SVCPU_FIELD ( SVCPU_SLB_MAX , slb_max ) ;
2014-04-29 18:48:44 +04:00
SVCPU_FIELD ( SVCPU_SHADOW_FSCR , shadow_fscr ) ;
2011-06-29 04:20:58 +04:00
# endif
HSTATE_FIELD ( HSTATE_HOST_R1 , host_r1 ) ;
HSTATE_FIELD ( HSTATE_HOST_R2 , host_r2 ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
HSTATE_FIELD ( HSTATE_HOST_MSR , host_msr ) ;
2011-06-29 04:20:58 +04:00
HSTATE_FIELD ( HSTATE_VMHANDLER , vmhandler ) ;
HSTATE_FIELD ( HSTATE_SCRATCH0 , scratch0 ) ;
HSTATE_FIELD ( HSTATE_SCRATCH1 , scratch1 ) ;
2013-11-11 17:59:47 +04:00
HSTATE_FIELD ( HSTATE_SCRATCH2 , scratch2 ) ;
2011-06-29 04:20:58 +04:00
HSTATE_FIELD ( HSTATE_IN_GUEST , in_guest ) ;
2011-07-23 11:41:44 +04:00
HSTATE_FIELD ( HSTATE_RESTORE_HID5 , restore_hid5 ) ;
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
core), whenever a CPU goes idle, we have to pull all the other
hardware threads in the core out of the guest, because the H_CEDE
hcall is handled in the kernel. This is inefficient.
This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
in real mode. When a guest vcpu does an H_CEDE hcall, we now only
exit to the kernel if all the other vcpus in the same core are also
idle. Otherwise we mark this vcpu as napping, save state that could
be lost in nap mode (mainly GPRs and FPRs), and execute the nap
instruction. When the thread wakes up, because of a decrementer or
external interrupt, we come back in at kvm_start_guest (from the
system reset interrupt vector), find the `napping' flag set in the
paca, and go to the resume path.
This has some other ramifications. First, when starting a core, we
now start all the threads, both those that are immediately runnable and
those that are idle. This is so that we don't have to pull all the
threads out of the guest when an idle thread gets a decrementer interrupt
and wants to start running. In fact the idle threads will all start
with the H_CEDE hcall returning; being idle they will just do another
H_CEDE immediately and go to nap mode.
This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
These functions have been restructured to make them simpler and clearer.
We introduce a level of indirection in the wait queue that gets woken
when external and decrementer interrupts get generated for a vcpu, so
that we can have the 4 vcpus in a vcore using the same wait queue.
We need this because the 4 vcpus are being handled by one thread.
Secondly, when we need to exit from the guest to the kernel, we now
have to generate an IPI for any napping threads, because an HDEC
interrupt doesn't wake up a napping thread.
Thirdly, we now need to be able to handle virtual external interrupts
and decrementer interrupts becoming pending while a thread is napping,
and deliver those interrupts to the guest when the thread wakes.
This is done in kvmppc_cede_reentry, just before fast_guest_return.
Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
from kvm_arch_vcpu_runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 11:42:46 +04:00
HSTATE_FIELD ( HSTATE_NAPPING , napping ) ;
2011-06-29 04:20:58 +04:00
2013-10-07 20:47:52 +04:00
# ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
2012-03-06 01:42:25 +04:00
HSTATE_FIELD ( HSTATE_HWTHREAD_REQ , hwthread_req ) ;
HSTATE_FIELD ( HSTATE_HWTHREAD_STATE , hwthread_state ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
HSTATE_FIELD ( HSTATE_KVM_VCPU , kvm_vcpu ) ;
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:23:08 +04:00
HSTATE_FIELD ( HSTATE_KVM_VCORE , kvm_vcore ) ;
2017-04-05 10:54:56 +03:00
HSTATE_FIELD ( HSTATE_XIVE_TIMA_PHYS , xive_tima_phys ) ;
HSTATE_FIELD ( HSTATE_XIVE_TIMA_VIRT , xive_tima_virt ) ;
2013-04-18 00:30:50 +04:00
HSTATE_FIELD ( HSTATE_HOST_IPI , host_ipi ) ;
KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbers
On a threaded processor such as POWER7, we group VCPUs into virtual
cores and arrange that the VCPUs in a virtual core run on the same
physical core. Currently we don't enforce any correspondence between
virtual thread numbers within a virtual core and physical thread
numbers. Physical threads are allocated starting at 0 on a first-come
first-served basis to runnable virtual threads (VCPUs).
POWER8 implements a new "msgsndp" instruction which guest kernels can
use to interrupt other threads in the same core or sub-core. Since
the instruction takes the destination physical thread ID as a parameter,
it becomes necessary to align the physical thread IDs with the virtual
thread IDs, that is, to make sure virtual thread N within a virtual
core always runs on physical thread N.
This means that it's possible that thread 0, which is where we call
__kvmppc_vcore_entry, may end up running some other vcpu than the
one whose task called kvmppc_run_core(), or it may end up running
no vcpu at all, if for example thread 0 of the virtual core is
currently executing in userspace. However, we do need thread 0
to be responsible for switching the MMU -- a previous version of
this patch that had other threads switching the MMU was found to
be responsible for occasional memory corruption and machine check
interrupts in the guest on POWER7 machines.
To accommodate this, we no longer pass the vcpu pointer to
__kvmppc_vcore_entry, but instead let the assembly code load it from
the PACA. Since the assembly code will need to know the kvm pointer
and the thread ID for threads which don't have a vcpu, we move the
thread ID into the PACA and we add a kvm pointer to the virtual core
structure.
In the case where thread 0 has no vcpu to run, it still calls into
kvmppc_hv_entry in order to do the MMU switch, and then naps until
either its vcpu is ready to run in the guest, or some other thread
needs to exit the guest. In the latter case, thread 0 jumps to the
code that switches the MMU back to the host. This control flow means
that now we switch the MMU before loading any guest vcpu state.
Similarly, on guest exit we now save all the guest vcpu state before
switching the MMU back to the host. This has required substantial
code movement, making the diff rather large.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-08 14:25:20 +04:00
HSTATE_FIELD ( HSTATE_PTID , ptid ) ;
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 13:32:01 +03:00
HSTATE_FIELD ( HSTATE_FAKE_SUSPEND , fake_suspend ) ;
2014-07-10 13:34:31 +04:00
HSTATE_FIELD ( HSTATE_MMCR0 , host_mmcr [ 0 ] ) ;
HSTATE_FIELD ( HSTATE_MMCR1 , host_mmcr [ 1 ] ) ;
HSTATE_FIELD ( HSTATE_MMCRA , host_mmcr [ 2 ] ) ;
HSTATE_FIELD ( HSTATE_SIAR , host_mmcr [ 3 ] ) ;
HSTATE_FIELD ( HSTATE_SDAR , host_mmcr [ 4 ] ) ;
HSTATE_FIELD ( HSTATE_MMCR2 , host_mmcr [ 5 ] ) ;
HSTATE_FIELD ( HSTATE_SIER , host_mmcr [ 6 ] ) ;
2020-07-17 17:38:17 +03:00
HSTATE_FIELD ( HSTATE_MMCR3 , host_mmcr [ 7 ] ) ;
HSTATE_FIELD ( HSTATE_SIER2 , host_mmcr [ 8 ] ) ;
HSTATE_FIELD ( HSTATE_SIER3 , host_mmcr [ 9 ] ) ;
2014-07-10 13:34:31 +04:00
HSTATE_FIELD ( HSTATE_PMC1 , host_pmc [ 0 ] ) ;
HSTATE_FIELD ( HSTATE_PMC2 , host_pmc [ 1 ] ) ;
HSTATE_FIELD ( HSTATE_PMC3 , host_pmc [ 2 ] ) ;
HSTATE_FIELD ( HSTATE_PMC4 , host_pmc [ 3 ] ) ;
HSTATE_FIELD ( HSTATE_PMC5 , host_pmc [ 4 ] ) ;
HSTATE_FIELD ( HSTATE_PMC6 , host_pmc [ 5 ] ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
HSTATE_FIELD ( HSTATE_PURR , host_purr ) ;
HSTATE_FIELD ( HSTATE_SPURR , host_spurr ) ;
HSTATE_FIELD ( HSTATE_DSCR , host_dscr ) ;
HSTATE_FIELD ( HSTATE_DABR , dabr ) ;
HSTATE_FIELD ( HSTATE_DECEXP , dec_expires ) ;
KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8
This builds on the ability to run more than one vcore on a physical
core by using the micro-threading (split-core) modes of the POWER8
chip. Previously, only vcores from the same VM could be run together,
and (on POWER8) only if they had just one thread per core. With the
ability to split the core on guest entry and unsplit it on guest exit,
we can run up to 8 vcpu threads from up to 4 different VMs, and we can
run multiple vcores with 2 or 4 vcpus per vcore.
Dynamic micro-threading is only available if the static configuration
of the cores is whole-core mode (unsplit), and only on POWER8.
To manage this, we introduce a new kvm_split_mode struct which is
shared across all of the subcores in the core, with a pointer in the
paca on each thread. In addition we extend the core_info struct to
have information on each subcore. When deciding whether to add a
vcore to the set already on the core, we now have two possibilities:
(a) piggyback the vcore onto an existing subcore, or (b) start a new
subcore.
Currently, when any vcpu needs to exit the guest and switch to host
virtual mode, we interrupt all the threads in all subcores and switch
the core back to whole-core mode. It may be possible in future to
allow some of the subcores to keep executing in the guest while
subcore 0 switches to the host, but that is not implemented in this
patch.
This adds a module parameter called dynamic_mt_modes which controls
which micro-threading (split-core) modes the code will consider, as a
bitmap. In other words, if it is 0, no micro-threading mode is
considered; if it is 2, only 2-way micro-threading is considered; if
it is 4, only 4-way, and if it is 6, both 2-way and 4-way
micro-threading mode will be considered. The default is 6.
With this, we now have secondary threads which are the primary thread
for their subcore and therefore need to do the MMU switch. These
threads will need to be started even if they have no vcpu to run, so
we use the vcore pointer in the PACA rather than the vcpu pointer to
trigger them.
It is now possible for thread 0 to find that an exit has been
requested before it gets to switch the subcore state to the guest. In
that case we haven't added the guest's timebase offset to the
timebase, so we need to be careful not to subtract the offset in the
guest exit path. In fact we just skip the whole path that switches
back to host context, since we haven't switched to the guest context.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-07-02 13:38:16 +03:00
HSTATE_FIELD ( HSTATE_SPLIT_MODE , kvm_split_mode ) ;
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
core), whenever a CPU goes idle, we have to pull all the other
hardware threads in the core out of the guest, because the H_CEDE
hcall is handled in the kernel. This is inefficient.
This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
in real mode. When a guest vcpu does an H_CEDE hcall, we now only
exit to the kernel if all the other vcpus in the same core are also
idle. Otherwise we mark this vcpu as napping, save state that could
be lost in nap mode (mainly GPRs and FPRs), and execute the nap
instruction. When the thread wakes up, because of a decrementer or
external interrupt, we come back in at kvm_start_guest (from the
system reset interrupt vector), find the `napping' flag set in the
paca, and go to the resume path.
This has some other ramifications. First, when starting a core, we
now start all the threads, both those that are immediately runnable and
those that are idle. This is so that we don't have to pull all the
threads out of the guest when an idle thread gets a decrementer interrupt
and wants to start running. In fact the idle threads will all start
with the H_CEDE hcall returning; being idle they will just do another
H_CEDE immediately and go to nap mode.
This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
These functions have been restructured to make them simpler and clearer.
We introduce a level of indirection in the wait queue that gets woken
when external and decrementer interrupts get generated for a vcpu, so
that we can have the 4 vcpus in a vcore using the same wait queue.
We need this because the 4 vcpus are being handled by one thread.
Secondly, when we need to exit from the guest to the kernel, we now
have to generate an IPI for any napping threads, because an HDEC
interrupt doesn't wake up a napping thread.
Thirdly, we now need to be able to handle virtual external interrupts
and decrementer interrupts becoming pending while a thread is napping,
and deliver those interrupts to the guest when the thread wakes.
This is done in kvmppc_cede_reentry, just before fast_guest_return.
Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
from kvm_arch_vcpu_runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 11:42:46 +04:00
DEFINE ( IPI_PRIORITY , IPI_PRIORITY ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( KVM_SPLIT_RPR , kvm_split_mode , rpr ) ;
OFFSET ( KVM_SPLIT_PMMAR , kvm_split_mode , pmmar ) ;
OFFSET ( KVM_SPLIT_LDBAR , kvm_split_mode , ldbar ) ;
OFFSET ( KVM_SPLIT_DO_NAP , kvm_split_mode , do_nap ) ;
OFFSET ( KVM_SPLIT_NAPPED , kvm_split_mode , napped ) ;
2013-10-07 20:47:52 +04:00
# endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
2013-02-04 22:10:51 +04:00
# ifdef CONFIG_PPC_BOOK3S_64
HSTATE_FIELD ( HSTATE_CFAR , cfar ) ;
2013-09-20 08:52:39 +04:00
HSTATE_FIELD ( HSTATE_PPR , ppr ) ;
2014-04-29 18:48:44 +04:00
HSTATE_FIELD ( HSTATE_HOST_FSCR , host_fscr ) ;
2013-02-04 22:10:51 +04:00
# endif /* CONFIG_PPC_BOOK3S_64 */
2011-06-29 04:20:58 +04:00
# else /* CONFIG_PPC_BOOK3S */
2018-10-08 08:30:58 +03:00
OFFSET ( VCPU_CR , kvm_vcpu , arch . regs . ccr ) ;
2018-05-07 09:20:08 +03:00
OFFSET ( VCPU_XER , kvm_vcpu , arch . regs . xer ) ;
OFFSET ( VCPU_LR , kvm_vcpu , arch . regs . link ) ;
OFFSET ( VCPU_CTR , kvm_vcpu , arch . regs . ctr ) ;
OFFSET ( VCPU_PC , kvm_vcpu , arch . regs . nip ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_SPRG9 , kvm_vcpu , arch . sprg9 ) ;
OFFSET ( VCPU_LAST_INST , kvm_vcpu , arch . last_inst ) ;
OFFSET ( VCPU_FAULT_DEAR , kvm_vcpu , arch . fault_dear ) ;
OFFSET ( VCPU_FAULT_ESR , kvm_vcpu , arch . fault_esr ) ;
OFFSET ( VCPU_CRIT_SAVE , kvm_vcpu , arch . crit_save ) ;
2010-04-16 02:11:42 +04:00
# endif /* CONFIG_PPC_BOOK3S */
2011-06-29 04:20:58 +04:00
# endif /* CONFIG_KVM */
2010-07-29 16:47:57 +04:00
# ifdef CONFIG_KVM_GUEST
2017-02-15 13:41:20 +03:00
OFFSET ( KVM_MAGIC_SCRATCH1 , kvm_vcpu_arch_shared , scratch1 ) ;
OFFSET ( KVM_MAGIC_SCRATCH2 , kvm_vcpu_arch_shared , scratch2 ) ;
OFFSET ( KVM_MAGIC_SCRATCH3 , kvm_vcpu_arch_shared , scratch3 ) ;
OFFSET ( KVM_MAGIC_INT , kvm_vcpu_arch_shared , int_pending ) ;
OFFSET ( KVM_MAGIC_MSR , kvm_vcpu_arch_shared , msr ) ;
OFFSET ( KVM_MAGIC_CRITICAL , kvm_vcpu_arch_shared , critical ) ;
OFFSET ( KVM_MAGIC_SR , kvm_vcpu_arch_shared , sr ) ;
2010-07-29 16:47:57 +04:00
# endif
2008-12-11 04:55:41 +03:00
# ifdef CONFIG_44x
DEFINE ( PGD_T_LOG2 , PGD_T_LOG2 ) ;
DEFINE ( PTE_T_LOG2 , PTE_T_LOG2 ) ;
# endif
2009-10-17 03:48:40 +04:00
# ifdef CONFIG_PPC_FSL_BOOK3E
2010-05-13 23:38:21 +04:00
DEFINE ( TLBCAM_SIZE , sizeof ( struct tlbcam ) ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( TLBCAM_MAS0 , tlbcam , MAS0 ) ;
OFFSET ( TLBCAM_MAS1 , tlbcam , MAS1 ) ;
OFFSET ( TLBCAM_MAS2 , tlbcam , MAS2 ) ;
OFFSET ( TLBCAM_MAS3 , tlbcam , MAS3 ) ;
OFFSET ( TLBCAM_MAS7 , tlbcam , MAS7 ) ;
2010-05-13 23:38:21 +04:00
# endif
2008-04-17 08:28:09 +04:00
2011-06-15 03:34:31 +04:00
# if defined(CONFIG_KVM) && defined(CONFIG_SPE)
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_EVR , kvm_vcpu , arch . evr [ 0 ] ) ;
OFFSET ( VCPU_ACC , kvm_vcpu , arch . acc ) ;
OFFSET ( VCPU_SPEFSCR , kvm_vcpu , arch . spefscr ) ;
OFFSET ( VCPU_HOST_SPEFSCR , kvm_vcpu , arch . host_spefscr ) ;
2011-06-15 03:34:31 +04:00
# endif
2011-12-20 19:34:43 +04:00
# ifdef CONFIG_KVM_BOOKE_HV
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_HOST_MAS4 , kvm_vcpu , arch . host_mas4 ) ;
OFFSET ( VCPU_HOST_MAS6 , kvm_vcpu , arch . host_mas6 ) ;
2011-12-20 19:34:43 +04:00
# endif
2017-04-05 10:54:56 +03:00
# ifdef CONFIG_KVM_XICS
DEFINE ( VCPU_XIVE_SAVED_STATE , offsetof ( struct kvm_vcpu ,
arch . xive_saved_state ) ) ;
DEFINE ( VCPU_XIVE_CAM_WORD , offsetof ( struct kvm_vcpu ,
arch . xive_cam_word ) ) ;
DEFINE ( VCPU_XIVE_PUSHED , offsetof ( struct kvm_vcpu , arch . xive_pushed ) ) ;
2018-01-12 05:37:16 +03:00
DEFINE ( VCPU_XIVE_ESC_ON , offsetof ( struct kvm_vcpu , arch . xive_esc_on ) ) ;
DEFINE ( VCPU_XIVE_ESC_RADDR , offsetof ( struct kvm_vcpu , arch . xive_esc_raddr ) ) ;
DEFINE ( VCPU_XIVE_ESC_VADDR , offsetof ( struct kvm_vcpu , arch . xive_esc_vaddr ) ) ;
2017-04-05 10:54:56 +03:00
# endif
2008-12-03 00:51:57 +03:00
# ifdef CONFIG_KVM_EXIT_TIMING
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_TIMING_EXIT_TBU , kvm_vcpu , arch . timing_exit . tv32 . tbu ) ;
OFFSET ( VCPU_TIMING_EXIT_TBL , kvm_vcpu , arch . timing_exit . tv32 . tbl ) ;
OFFSET ( VCPU_TIMING_LAST_ENTER_TBU , kvm_vcpu , arch . timing_last_enter . tv32 . tbu ) ;
OFFSET ( VCPU_TIMING_LAST_ENTER_TBL , kvm_vcpu , arch . timing_last_enter . tv32 . tbl ) ;
2008-12-03 00:51:57 +03:00
# endif
KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8
This uses msgsnd where possible for signalling other threads within
the same core on POWER8 systems, rather than IPIs through the XICS
interrupt controller. This includes waking secondary threads to run
the guest, the interrupts generated by the virtual XICS, and the
interrupts to bring the other threads out of the guest when exiting.
Aggregated statistics from debugfs across vcpus for a guest with 32
vcpus, 8 threads/vcore, running on a POWER8, show this before the
change:
rm_entry: 3387.6ns (228 - 86600, 1008969 samples)
rm_exit: 4561.5ns (12 - 3477452, 1009402 samples)
rm_intr: 1660.0ns (12 - 553050, 3600051 samples)
and this after the change:
rm_entry: 3060.1ns (212 - 65138, 953873 samples)
rm_exit: 4244.1ns (12 - 9693408, 954331 samples)
rm_intr: 1342.3ns (12 - 1104718, 3405326 samples)
for a test of booting Fedora 20 big-endian to the login prompt.
The time taken for a H_PROD hcall (which is handled in the host
kernel) went down from about 35 microseconds to about 16 microseconds
with this change.
The noinline added to kvmppc_run_core turned out to be necessary for
good performance, at least with gcc 4.9.2 as packaged with Fedora 21
and a little-endian POWER8 host.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-28 06:21:12 +03:00
DEFINE ( PPC_DBELL_SERVER , PPC_DBELL_SERVER ) ;
powerpc/8xx: Fix vaddr for IMMR early remap
Memory: 124428K/131072K available (3748K kernel code, 188K rwdata,
648K rodata, 508K init, 290K bss, 6644K reserved)
Kernel virtual memory layout:
* 0xfffdf000..0xfffff000 : fixmap
* 0xfde00000..0xfe000000 : consistent mem
* 0xfddf6000..0xfde00000 : early ioremap
* 0xc9000000..0xfddf6000 : vmalloc & ioremap
SLUB: HWalign=16, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
Today, IMMR is mapped 1:1 at startup
Mapping IMMR 1:1 is just wrong because it may overlap with another
area. On most mpc8xx boards it is OK as IMMR is set to 0xff000000
but for instance on EP88xC board, IMMR is at 0xfa200000 which
overlaps with VM ioremap area
This patch fixes the virtual address for remapping IMMR with the fixmap
regardless of the value of IMMR.
The size of IMMR area is 256kbytes (CPM at offset 0, security engine
at offset 128k) so a 512k page is enough
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-05-17 10:02:43 +03:00
# ifdef CONFIG_PPC_8xx
2016-07-09 11:22:39 +03:00
DEFINE ( VIRT_IMMR_BASE , ( u64 ) __fix_to_virt ( FIX_IMMR_BASE ) ) ;
powerpc/8xx: Fix vaddr for IMMR early remap
Memory: 124428K/131072K available (3748K kernel code, 188K rwdata,
648K rodata, 508K init, 290K bss, 6644K reserved)
Kernel virtual memory layout:
* 0xfffdf000..0xfffff000 : fixmap
* 0xfde00000..0xfe000000 : consistent mem
* 0xfddf6000..0xfde00000 : early ioremap
* 0xc9000000..0xfddf6000 : vmalloc & ioremap
SLUB: HWalign=16, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
Today, IMMR is mapped 1:1 at startup
Mapping IMMR 1:1 is just wrong because it may overlap with another
area. On most mpc8xx boards it is OK as IMMR is set to 0xff000000
but for instance on EP88xC board, IMMR is at 0xfa200000 which
overlaps with VM ioremap area
This patch fixes the virtual address for remapping IMMR with the fixmap
regardless of the value of IMMR.
The size of IMMR area is 256kbytes (CPM at offset 0, security engine
at offset 128k) so a 512k page is enough
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-05-17 10:02:43 +03:00
# endif
2020-05-06 06:40:23 +03:00
# ifdef CONFIG_XMON
DEFINE ( BPT_SIZE , BPT_SIZE ) ;
# endif
2005-09-26 10:04:21 +04:00
return 0 ;
}