2005-09-26 10:04:21 +04:00
/*
* This program is used to generate definitions needed by
* assembly language modules .
*
* We use the technique used in the OSF Mach kernel code :
* generate asm statements containing # defines ,
* compile this file to assembler , and then extract the
* # defines from the assembly - language output .
*
* This program is free software ; you can redistribute it and / or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation ; either version
* 2 of the License , or ( at your option ) any later version .
*/
2018-03-14 07:03:25 +03:00
# include <linux/compat.h>
2005-09-26 10:04:21 +04:00
# include <linux/signal.h>
# include <linux/sched.h>
# include <linux/kernel.h>
# include <linux/errno.h>
# include <linux/string.h>
# include <linux/types.h>
# include <linux/mman.h>
# include <linux/mm.h>
2007-05-03 16:31:38 +04:00
# include <linux/suspend.h>
2008-02-05 08:16:48 +03:00
# include <linux/hrtimer.h>
2005-09-28 18:35:31 +04:00
# ifdef CONFIG_PPC64
2005-09-26 10:04:21 +04:00
# include <linux/time.h>
# include <linux/hardirq.h>
2005-09-28 18:35:31 +04:00
# endif
2008-04-29 12:04:08 +04:00
# include <linux/kbuild.h>
2005-09-28 18:35:31 +04:00
2005-09-26 10:04:21 +04:00
# include <asm/io.h>
# include <asm/page.h>
# include <asm/pgtable.h>
# include <asm/processor.h>
# include <asm/cputable.h>
# include <asm/thread_info.h>
2005-10-26 11:05:24 +04:00
# include <asm/rtas.h>
2005-11-11 13:15:21 +03:00
# include <asm/vdso_datapage.h>
KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8
This uses msgsnd where possible for signalling other threads within
the same core on POWER8 systems, rather than IPIs through the XICS
interrupt controller. This includes waking secondary threads to run
the guest, the interrupts generated by the virtual XICS, and the
interrupts to bring the other threads out of the guest when exiting.
Aggregated statistics from debugfs across vcpus for a guest with 32
vcpus, 8 threads/vcore, running on a POWER8, show this before the
change:
rm_entry: 3387.6ns (228 - 86600, 1008969 samples)
rm_exit: 4561.5ns (12 - 3477452, 1009402 samples)
rm_intr: 1660.0ns (12 - 553050, 3600051 samples)
and this after the change:
rm_entry: 3060.1ns (212 - 65138, 953873 samples)
rm_exit: 4244.1ns (12 - 9693408, 954331 samples)
rm_intr: 1342.3ns (12 - 1104718, 3405326 samples)
for a test of booting Fedora 20 big-endian to the login prompt.
The time taken for a H_PROD hcall (which is handled in the host
kernel) went down from about 35 microseconds to about 16 microseconds
with this change.
The noinline added to kvmppc_run_core turned out to be necessary for
good performance, at least with gcc 4.9.2 as packaged with Fedora 21
and a little-endian POWER8 host.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-28 06:21:12 +03:00
# include <asm/dbell.h>
2005-09-26 10:04:21 +04:00
# ifdef CONFIG_PPC64
# include <asm/paca.h>
# include <asm/lppaca.h>
# include <asm/cache.h>
2006-08-09 11:00:30 +04:00
# include <asm/mmu.h>
2006-09-13 22:32:39 +04:00
# include <asm/hvcall.h>
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
core), whenever a CPU goes idle, we have to pull all the other
hardware threads in the core out of the guest, because the H_CEDE
hcall is handled in the kernel. This is inefficient.
This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
in real mode. When a guest vcpu does an H_CEDE hcall, we now only
exit to the kernel if all the other vcpus in the same core are also
idle. Otherwise we mark this vcpu as napping, save state that could
be lost in nap mode (mainly GPRs and FPRs), and execute the nap
instruction. When the thread wakes up, because of a decrementer or
external interrupt, we come back in at kvm_start_guest (from the
system reset interrupt vector), find the `napping' flag set in the
paca, and go to the resume path.
This has some other ramifications. First, when starting a core, we
now start all the threads, both those that are immediately runnable and
those that are idle. This is so that we don't have to pull all the
threads out of the guest when an idle thread gets a decrementer interrupt
and wants to start running. In fact the idle threads will all start
with the H_CEDE hcall returning; being idle they will just do another
H_CEDE immediately and go to nap mode.
This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
These functions have been restructured to make them simpler and clearer.
We introduce a level of indirection in the wait queue that gets woken
when external and decrementer interrupts get generated for a vcpu, so
that we can have the 4 vcpus in a vcore using the same wait queue.
We need this because the 4 vcpus are being handled by one thread.
Secondly, when we need to exit from the guest to the kernel, we now
have to generate an IPI for any napping threads, because an HDEC
interrupt doesn't wake up a napping thread.
Thirdly, we now need to be able to handle virtual external interrupts
and decrementer interrupts becoming pending while a thread is napping,
and deliver those interrupts to the guest when the thread wakes.
This is done in kvmppc_cede_reentry, just before fast_guest_return.
Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
from kvm_arch_vcpu_runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 11:42:46 +04:00
# include <asm/xics.h>
2005-09-26 10:04:21 +04:00
# endif
2011-09-19 21:45:04 +04:00
# ifdef CONFIG_PPC_POWERNV
# include <asm/opal.h>
# endif
2010-08-30 14:01:56 +04:00
# if defined(CONFIG_KVM) || defined(CONFIG_KVM_GUEST)
2009-01-04 01:23:08 +03:00
# include <linux/kvm_host.h>
2010-04-16 02:11:44 +04:00
# endif
2010-08-30 14:01:56 +04:00
# if defined(CONFIG_KVM) && defined(CONFIG_PPC_BOOK3S)
# include <asm/kvm_book3s.h>
2014-04-24 15:46:24 +04:00
# include <asm/kvm_ppc.h>
2008-11-05 18:36:18 +03:00
# endif
2005-09-26 10:04:21 +04:00
2009-07-28 05:59:34 +04:00
# ifdef CONFIG_PPC32
2008-04-30 14:23:21 +04:00
# if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
# include "head_booke.h"
# endif
2009-07-28 05:59:34 +04:00
# endif
2008-04-30 14:23:21 +04:00
2009-10-17 03:48:40 +04:00
# if defined(CONFIG_PPC_FSL_BOOK3E)
2008-12-09 06:34:55 +03:00
# include "../mm/mmu_decl.h"
# endif
powerpc/8xx: Fix vaddr for IMMR early remap
Memory: 124428K/131072K available (3748K kernel code, 188K rwdata,
648K rodata, 508K init, 290K bss, 6644K reserved)
Kernel virtual memory layout:
* 0xfffdf000..0xfffff000 : fixmap
* 0xfde00000..0xfe000000 : consistent mem
* 0xfddf6000..0xfde00000 : early ioremap
* 0xc9000000..0xfddf6000 : vmalloc & ioremap
SLUB: HWalign=16, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
Today, IMMR is mapped 1:1 at startup
Mapping IMMR 1:1 is just wrong because it may overlap with another
area. On most mpc8xx boards it is OK as IMMR is set to 0xff000000
but for instance on EP88xC board, IMMR is at 0xfa200000 which
overlaps with VM ioremap area
This patch fixes the virtual address for remapping IMMR with the fixmap
regardless of the value of IMMR.
The size of IMMR area is 256kbytes (CPM at offset 0, security engine
at offset 128k) so a 512k page is enough
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-05-17 10:02:43 +03:00
# ifdef CONFIG_PPC_8xx
# include <asm/fixmap.h>
# endif
2016-06-02 07:29:47 +03:00
# define STACK_PT_REGS_OFFSET(sym, val) \
DEFINE ( sym , STACK_FRAME_OVERHEAD + offsetof ( struct pt_regs , val ) )
2005-09-26 10:04:21 +04:00
int main ( void )
{
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD , task_struct , thread ) ;
OFFSET ( MM , task_struct , mm ) ;
2018-09-27 10:05:53 +03:00
# ifdef CONFIG_STACKPROTECTOR
OFFSET ( TASK_CANARY , task_struct , stack_canary ) ;
# endif
2017-02-15 13:41:20 +03:00
OFFSET ( MMCONTEXTID , mm_struct , context . id ) ;
2005-09-26 10:04:21 +04:00
# ifdef CONFIG_PPC64
powerpc: Allow perf_counters to access user memory at interrupt time
This provides a mechanism to allow the perf_counters code to access
user memory in a PMU interrupt routine. Such an access can cause
various kinds of interrupt: SLB miss, MMU hash table miss, segment
table miss, or TLB miss, depending on the processor. This commit
only deals with 64-bit classic/server processors, which use an MMU
hash table. 32-bit processors are already able to access user memory
at interrupt time. Since we don't soft-disable on 32-bit, we avoid
the possibility of reentering hash_page or the TLB miss handlers,
since they run with interrupts disabled.
On 64-bit processors, an SLB miss interrupt on a user address will
update the slb_cache and slb_cache_ptr fields in the paca. This is
OK except in the case where a PMU interrupt occurs in switch_slb,
which also accesses those fields. To prevent this, we hard-disable
interrupts in switch_slb. Interrupts are already soft-disabled at
this point, and will get hard-enabled when they get soft-enabled
later.
This also reworks slb_flush_and_rebolt: to avoid hard-disabling twice,
and to make sure that it clears the slb_cache_ptr when called from
other callers than switch_slb, the existing routine is renamed to
__slb_flush_and_rebolt, which is called by switch_slb and the new
version of slb_flush_and_rebolt.
Similarly, switch_stab (used on POWER3 and RS64 processors) gets a
hard_irq_disable() to protect the per-cpu variables used there and
in ste_allocate.
If a MMU hashtable miss interrupt occurs, normally we would call
hash_page to look up the Linux PTE for the address and create a HPTE.
However, hash_page is fairly complex and takes some locks, so to
avoid the possibility of deadlock, we check the preemption count
to see if we are in a (pseudo-)NMI handler, and if so, we don't call
hash_page but instead treat it like a bad access that will get
reported up through the exception table mechanism. An interrupt
whose handler runs even though the interrupt occurred when
soft-disabled (such as the PMU interrupt) is considered a pseudo-NMI
handler, which should use nmi_enter()/nmi_exit() rather than
irq_enter()/irq_exit().
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-08-17 09:17:54 +04:00
DEFINE ( SIGSEGV , SIGSEGV ) ;
DEFINE ( NMI_MASK , NMI_MASK ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( TASKTHREADPPR , task_struct , thread . ppr ) ;
2005-09-28 18:35:31 +04:00
# else
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_INFO , task_struct , stack ) ;
2013-09-24 09:17:21 +04:00
DEFINE ( THREAD_INFO_GAP , _ALIGN_UP ( sizeof ( struct thread_info ) , 16 ) ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( KSP_LIMIT , thread_struct , ksp_limit ) ;
2005-09-28 18:35:31 +04:00
# endif /* CONFIG_PPC64 */
powerpc/livepatch: Add live patching support on ppc64le
Add the kconfig logic & assembly support for handling live patched
functions. This depends on DYNAMIC_FTRACE_WITH_REGS, which in turn
depends on the new -mprofile-kernel ftrace ABI, which is only supported
currently on ppc64le.
Live patching is handled by a special ftrace handler. This means it runs
from ftrace_caller(). The live patch handler modifies the NIP so as to
redirect the return from ftrace_caller() to the new patched function.
However there is one particularly tricky case we need to handle.
If a function A calls another function B, and it is known at link time
that they share the same TOC, then A will not save or restore its TOC,
and will call the local entry point of B.
When we live patch B, we replace it with a new function C, which may
not have the same TOC as A. At live patch time it's too late to modify A
to do the TOC save/restore, so the live patching code must interpose
itself between A and C, and do the TOC save/restore that A omitted.
An additionaly complication is that the livepatch code can not create a
stack frame in order to save the TOC. That is because if C takes > 8
arguments, or is varargs, A will have written the arguments for C in
A's stack frame.
To solve this, we introduce a "livepatch stack" which grows upward from
the base of the regular stack, and is used to store the TOC & LR when
calling a live patched function.
When the patched function returns, we retrieve the real LR & TOC from
the livepatch stack, restore them, and pop the livepatch "stack frame".
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Torsten Duwe <duwe@suse.de>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
2016-03-24 14:04:05 +03:00
# ifdef CONFIG_LIVEPATCH
2017-02-15 13:41:20 +03:00
OFFSET ( TI_livepatch_sp , thread_info , livepatch_sp ) ;
powerpc/livepatch: Add live patching support on ppc64le
Add the kconfig logic & assembly support for handling live patched
functions. This depends on DYNAMIC_FTRACE_WITH_REGS, which in turn
depends on the new -mprofile-kernel ftrace ABI, which is only supported
currently on ppc64le.
Live patching is handled by a special ftrace handler. This means it runs
from ftrace_caller(). The live patch handler modifies the NIP so as to
redirect the return from ftrace_caller() to the new patched function.
However there is one particularly tricky case we need to handle.
If a function A calls another function B, and it is known at link time
that they share the same TOC, then A will not save or restore its TOC,
and will call the local entry point of B.
When we live patch B, we replace it with a new function C, which may
not have the same TOC as A. At live patch time it's too late to modify A
to do the TOC save/restore, so the live patching code must interpose
itself between A and C, and do the TOC save/restore that A omitted.
An additionaly complication is that the livepatch code can not create a
stack frame in order to save the TOC. That is because if C takes > 8
arguments, or is varargs, A will have written the arguments for C in
A's stack frame.
To solve this, we introduce a "livepatch stack" which grows upward from
the base of the regular stack, and is used to store the TOC & LR when
calling a live patched function.
When the patched function returns, we retrieve the real LR & TOC from
the livepatch stack, restore them, and pop the livepatch "stack frame".
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Torsten Duwe <duwe@suse.de>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
2016-03-24 14:04:05 +03:00
# endif
2017-02-15 13:41:20 +03:00
OFFSET ( KSP , thread_struct , ksp ) ;
OFFSET ( PT_REGS , thread_struct , regs ) ;
2011-04-23 01:48:27 +04:00
# ifdef CONFIG_BOOKE
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_NORMSAVES , thread_struct , normsave [ 0 ] ) ;
2011-04-23 01:48:27 +04:00
# endif
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_FPEXC_MODE , thread_struct , fpexc_mode ) ;
2017-05-08 09:23:31 +03:00
OFFSET ( THREAD_FPSTATE , thread_struct , fp_state . fpr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_FPSAVEAREA , thread_struct , fp_save_area ) ;
OFFSET ( FPSTATE_FPSCR , thread_fp_state , fpscr ) ;
OFFSET ( THREAD_LOAD_FP , thread_struct , load_fp ) ;
2005-09-26 10:04:21 +04:00
# ifdef CONFIG_ALTIVEC
2017-05-08 09:23:31 +03:00
OFFSET ( THREAD_VRSTATE , thread_struct , vr_state . vr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_VRSAVEAREA , thread_struct , vr_save_area ) ;
OFFSET ( THREAD_VRSAVE , thread_struct , vrsave ) ;
OFFSET ( THREAD_USED_VR , thread_struct , used_vr ) ;
OFFSET ( VRSTATE_VSCR , thread_vr_state , vscr ) ;
OFFSET ( THREAD_LOAD_VEC , thread_struct , load_vec ) ;
2005-09-26 10:04:21 +04:00
# endif /* CONFIG_ALTIVEC */
2008-06-25 08:07:18 +04:00
# ifdef CONFIG_VSX
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_USED_VSR , thread_struct , used_vsr ) ;
2008-06-25 08:07:18 +04:00
# endif /* CONFIG_VSX */
2005-09-28 18:35:31 +04:00
# ifdef CONFIG_PPC64
2017-02-15 13:41:20 +03:00
OFFSET ( KSP_VSID , thread_struct , ksp_vsid ) ;
2005-09-28 18:35:31 +04:00
# else /* CONFIG_PPC64 */
2017-02-15 13:41:20 +03:00
OFFSET ( PGDIR , thread_struct , pgdir ) ;
2005-09-26 10:04:21 +04:00
# ifdef CONFIG_SPE
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_EVR0 , thread_struct , evr [ 0 ] ) ;
OFFSET ( THREAD_ACC , thread_struct , acc ) ;
OFFSET ( THREAD_SPEFSCR , thread_struct , spefscr ) ;
OFFSET ( THREAD_USED_SPE , thread_struct , used_spe ) ;
2005-09-26 10:04:21 +04:00
# endif /* CONFIG_SPE */
2005-09-28 18:35:31 +04:00
# endif /* CONFIG_PPC64 */
2013-05-22 08:20:59 +04:00
# if defined(CONFIG_4xx) || defined(CONFIG_BOOKE)
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_DBCR0 , thread_struct , debug . dbcr0 ) ;
2013-05-22 08:20:59 +04:00
# endif
2010-04-16 02:11:51 +04:00
# ifdef CONFIG_KVM_BOOK3S_32_HANDLER
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_KVM_SVCPU , thread_struct , kvm_shadow_vcpu ) ;
2010-04-16 02:11:51 +04:00
# endif
2013-01-16 02:20:42 +04:00
# if defined(CONFIG_KVM) && defined(CONFIG_BOOKE)
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_KVM_VCPU , thread_struct , kvm_vcpu ) ;
2011-12-20 19:34:43 +04:00
# endif
2005-09-28 18:35:31 +04:00
2013-02-13 20:21:32 +04:00
# ifdef CONFIG_PPC_TRANSACTIONAL_MEM
2017-02-15 13:41:20 +03:00
OFFSET ( PACATMSCRATCH , paca_struct , tm_scratch ) ;
OFFSET ( THREAD_TM_TFHAR , thread_struct , tm_tfhar ) ;
OFFSET ( THREAD_TM_TEXASR , thread_struct , tm_texasr ) ;
OFFSET ( THREAD_TM_TFIAR , thread_struct , tm_tfiar ) ;
OFFSET ( THREAD_TM_TAR , thread_struct , tm_tar ) ;
OFFSET ( THREAD_TM_PPR , thread_struct , tm_ppr ) ;
OFFSET ( THREAD_TM_DSCR , thread_struct , tm_dscr ) ;
OFFSET ( PT_CKPT_REGS , thread_struct , ckpt_regs ) ;
2017-05-08 09:23:31 +03:00
OFFSET ( THREAD_CKVRSTATE , thread_struct , ckvr_state . vr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( THREAD_CKVRSAVE , thread_struct , ckvrsave ) ;
2017-05-08 09:23:31 +03:00
OFFSET ( THREAD_CKFPSTATE , thread_struct , ckfp_state . fpr ) ;
2013-02-13 20:21:32 +04:00
/* Local pt_regs on stack for Transactional Memory funcs. */
DEFINE ( TM_FRAME_SIZE , STACK_FRAME_OVERHEAD +
sizeof ( struct pt_regs ) + 16 ) ;
# endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
2013-02-07 19:46:58 +04:00
2017-02-15 13:41:20 +03:00
OFFSET ( TI_FLAGS , thread_info , flags ) ;
OFFSET ( TI_LOCAL_FLAGS , thread_info , local_flags ) ;
OFFSET ( TI_PREEMPT , thread_info , preempt_count ) ;
OFFSET ( TI_TASK , thread_info , task ) ;
OFFSET ( TI_CPU , thread_info , cpu ) ;
2005-09-28 18:35:31 +04:00
# ifdef CONFIG_PPC64
2017-02-15 13:41:20 +03:00
OFFSET ( DCACHEL1BLOCKSIZE , ppc64_caches , l1d . block_size ) ;
OFFSET ( DCACHEL1LOGBLOCKSIZE , ppc64_caches , l1d . log_block_size ) ;
OFFSET ( DCACHEL1BLOCKSPERPAGE , ppc64_caches , l1d . blocks_per_page ) ;
OFFSET ( ICACHEL1BLOCKSIZE , ppc64_caches , l1i . block_size ) ;
OFFSET ( ICACHEL1LOGBLOCKSIZE , ppc64_caches , l1i . log_block_size ) ;
OFFSET ( ICACHEL1BLOCKSPERPAGE , ppc64_caches , l1i . blocks_per_page ) ;
2005-09-28 18:35:31 +04:00
/* paca */
DEFINE ( PACA_SIZE , sizeof ( struct paca_struct ) ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( PACAPACAINDEX , paca_struct , paca_index ) ;
OFFSET ( PACAPROCSTART , paca_struct , cpu_start ) ;
OFFSET ( PACAKSAVE , paca_struct , kstack ) ;
OFFSET ( PACACURRENT , paca_struct , __current ) ;
OFFSET ( PACASAVEDMSR , paca_struct , saved_msr ) ;
2018-10-02 16:56:39 +03:00
OFFSET ( PACASTABRR , paca_struct , stab_rr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( PACAR1 , paca_struct , saved_r1 ) ;
OFFSET ( PACATOC , paca_struct , kernel_toc ) ;
OFFSET ( PACAKBASE , paca_struct , kernelbase ) ;
OFFSET ( PACAKMSR , paca_struct , kernel_msr ) ;
2017-12-20 06:55:50 +03:00
OFFSET ( PACAIRQSOFTMASK , paca_struct , irq_soft_mask ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( PACAIRQHAPPENED , paca_struct , irq_happened ) ;
2018-04-19 10:04:00 +03:00
OFFSET ( PACA_FTRACE_ENABLED , paca_struct , ftrace_enabled ) ;
2018-10-02 16:56:39 +03:00
# ifdef CONFIG_PPC_BOOK3S
OFFSET ( PACACONTEXTID , paca_struct , mm_ctx_id ) ;
# ifdef CONFIG_PPC_MM_SLICES
OFFSET ( PACALOWSLICESPSIZE , paca_struct , mm_ctx_low_slices_psize ) ;
OFFSET ( PACAHIGHSLICEPSIZE , paca_struct , mm_ctx_high_slices_psize ) ;
OFFSET ( PACA_SLB_ADDR_LIMIT , paca_struct , mm_ctx_slb_addr_limit ) ;
DEFINE ( MMUPSIZEDEFSIZE , sizeof ( struct mmu_psize_def ) ) ;
# endif /* CONFIG_PPC_MM_SLICES */
# endif
2009-07-24 03:15:42 +04:00
# ifdef CONFIG_PPC_BOOK3E
2017-02-15 13:41:20 +03:00
OFFSET ( PACAPGD , paca_struct , pgd ) ;
OFFSET ( PACA_KERNELPGD , paca_struct , kernel_pgd ) ;
OFFSET ( PACA_EXGEN , paca_struct , exgen ) ;
OFFSET ( PACA_EXTLB , paca_struct , extlb ) ;
OFFSET ( PACA_EXMC , paca_struct , exmc ) ;
OFFSET ( PACA_EXCRIT , paca_struct , excrit ) ;
OFFSET ( PACA_EXDBG , paca_struct , exdbg ) ;
OFFSET ( PACA_MC_STACK , paca_struct , mc_kstack ) ;
OFFSET ( PACA_CRIT_STACK , paca_struct , crit_kstack ) ;
OFFSET ( PACA_DBG_STACK , paca_struct , dbg_kstack ) ;
OFFSET ( PACA_TCD_PTR , paca_struct , tcd_ptr ) ;
OFFSET ( TCD_ESEL_NEXT , tlb_core_data , esel_next ) ;
OFFSET ( TCD_ESEL_MAX , tlb_core_data , esel_max ) ;
OFFSET ( TCD_ESEL_FIRST , tlb_core_data , esel_first ) ;
2009-07-24 03:15:42 +04:00
# endif /* CONFIG_PPC_BOOK3E */
2017-10-19 07:08:43 +03:00
# ifdef CONFIG_PPC_BOOK3S_64
2017-02-15 13:41:20 +03:00
OFFSET ( PACASLBCACHE , paca_struct , slb_cache ) ;
OFFSET ( PACASLBCACHEPTR , paca_struct , slb_cache_ptr ) ;
OFFSET ( PACAVMALLOCSLLP , paca_struct , vmalloc_sllp ) ;
2009-06-03 01:17:41 +04:00
# ifdef CONFIG_PPC_MM_SLICES
2017-02-15 13:41:20 +03:00
OFFSET ( MMUPSIZESLLP , mmu_psize_def , sllp ) ;
2007-05-08 10:27:27 +04:00
# else
2017-02-15 13:41:20 +03:00
OFFSET ( PACACONTEXTSLLP , paca_struct , mm_ctx_sllp ) ;
2007-05-08 10:27:27 +04:00
# endif /* CONFIG_PPC_MM_SLICES */
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_EXGEN , paca_struct , exgen ) ;
OFFSET ( PACA_EXMC , paca_struct , exmc ) ;
OFFSET ( PACA_EXSLB , paca_struct , exslb ) ;
2016-12-19 21:30:04 +03:00
OFFSET ( PACA_EXNMI , paca_struct , exnmi ) ;
2018-02-13 18:08:11 +03:00
# ifdef CONFIG_PPC_PSERIES
2017-02-15 13:41:20 +03:00
OFFSET ( PACALPPACAPTR , paca_struct , lppaca_ptr ) ;
2018-02-13 18:08:11 +03:00
# endif
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_SLBSHADOWPTR , paca_struct , slb_shadow_ptr ) ;
OFFSET ( SLBSHADOW_STACKVSID , slb_shadow , save_area [ SLB_NUM_BOLTED - 1 ] . vsid ) ;
OFFSET ( SLBSHADOW_STACKESID , slb_shadow , save_area [ SLB_NUM_BOLTED - 1 ] . esid ) ;
OFFSET ( SLBSHADOW_SAVEAREA , slb_shadow , save_area ) ;
OFFSET ( LPPACA_PMCINUSE , lppaca , pmcregs_in_use ) ;
2018-02-13 18:08:11 +03:00
# ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
OFFSET ( PACA_PMCINUSE , paca_struct , pmcregs_in_use ) ;
# endif
2017-02-15 13:41:20 +03:00
OFFSET ( LPPACA_DTLIDX , lppaca , dtl_idx ) ;
OFFSET ( LPPACA_YIELDCOUNT , lppaca , yield_count ) ;
OFFSET ( PACA_DTL_RIDX , paca_struct , dtl_ridx ) ;
2017-10-19 07:08:43 +03:00
# endif /* CONFIG_PPC_BOOK3S_64 */
2017-02-15 13:41:20 +03:00
OFFSET ( PACAEMERGSP , paca_struct , emergency_sp ) ;
powerpc/book3s: handle machine check in Linux host.
Move machine check entry point into Linux. So far we were dependent on
firmware to decode MCE error details and handover the high level info to OS.
This patch introduces early machine check routine that saves the MCE
information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate
stack frame on emergency stack and set the r1 accordingly. This allows us to be
prepared to take another exception without loosing context. One thing to note
here that, if we get another machine check while ME bit is off then we risk a
checkstop. Hence we restrict ourselves to save only MCE information and
register saved on PACA_EXMC save are before we turn the ME bit on. We use
paca->in_mce flag to differentiate between first entry and nested machine check
entry which helps proper use of emergency stack. We increment paca->in_mce
every time we enter in early machine check handler and decrement it while
leaving. When we enter machine check early handler first time (paca->in_mce ==
0), we are sure nobody is using MC emergency stack and allocate a stack frame
at the start of the emergency stack. During subsequent entry (paca->in_mce >
0), we know that r1 points inside emergency stack and we allocate separate
stack frame accordingly. This prevents us from clobbering MCE information
during nested machine checks.
The early machine check handler changes are placed under CPU_FTR_HVMODE
section. This makes sure that the early machine check handler will get executed
only in hypervisor kernel.
This is the code flow:
Machine Check Interrupt
|
V
0x200 vector ME=0, IR=0, DR=0
|
V
+-----------------------------------------------+
|machine_check_pSeries_early: | ME=0, IR=0, DR=0
| Alloc frame on emergency stack |
| Save srr1, srr0, dar and dsisr on stack |
+-----------------------------------------------+
|
(ME=1, IR=0, DR=0, RFID)
|
V
machine_check_handle_early ME=1, IR=0, DR=0
|
V
+-----------------------------------------------+
| machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0
| Things to do: (in next patches) |
| Flush SLB for SLB errors |
| Flush TLB for TLB errors |
| Decode and save MCE info |
+-----------------------------------------------+
|
(Fall through existing exception handler routine.)
|
V
machine_check_pSerie ME=1, IR=0, DR=0
|
(ME=1, IR=1, DR=1, RFID)
|
V
machine_check_common ME=1, IR=1, DR=1
.
.
.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-30 18:34:08 +04:00
# ifdef CONFIG_PPC_BOOK3S_64
2017-02-15 13:41:20 +03:00
OFFSET ( PACAMCEMERGSP , paca_struct , mc_emergency_sp ) ;
2016-12-19 21:30:06 +03:00
OFFSET ( PACA_NMI_EMERG_SP , paca_struct , nmi_emergency_sp ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_IN_MCE , paca_struct , in_mce ) ;
2016-12-19 21:30:05 +03:00
OFFSET ( PACA_IN_NMI , paca_struct , in_nmi ) ;
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 19:07:15 +03:00
OFFSET ( PACA_RFI_FLUSH_FALLBACK_AREA , paca_struct , rfi_flush_fallback_area ) ;
OFFSET ( PACA_EXRFI , paca_struct , exrfi ) ;
2018-01-17 16:58:18 +03:00
OFFSET ( PACA_L1D_FLUSH_SIZE , paca_struct , l1d_flush_size ) ;
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 19:07:15 +03:00
2017-02-15 13:41:20 +03:00
# endif
OFFSET ( PACAHWCPUID , paca_struct , hw_cpu_id ) ;
OFFSET ( PACAKEXECSTATE , paca_struct , kexec_state ) ;
OFFSET ( PACA_DSCR_DEFAULT , paca_struct , dscr_default ) ;
OFFSET ( ACCOUNT_STARTTIME , paca_struct , accounting . starttime ) ;
OFFSET ( ACCOUNT_STARTTIME_USER , paca_struct , accounting . starttime_user ) ;
powerpc updates for 4.11 part 2
Highlights include:
- An update of the disassembly code used by xmon to the latest versions in
binutils. We've received permission from all the authors of the relevant
binutils changes to relicense their changes to the relevant files from GPLv3
to GPLv2, for inclusion in Linux. Thanks to Peter Bergner for doing the leg
work to get permission from everyone.
- Addition of the "architected" Power9 CPU table entry, allowing us to boot
in Power9 architected mode under a hypervisor.
- Updates to the Power9 PMU code.
- Implementation of clear_bit_unlock_is_negative_byte() to optimise
unlock_page().
- Freescale updates from Scott: "Highlights include 8xx breakpoints and perf,
t1042rdb display support, and board updates."
Thanks to:
Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Balbir Singh, Douglas Miller,
Frédéric Weisbecker, Gavin Shan, Madhavan Srinivasan, Michael Roth, Nathan
Fontenot, Naveen N. Rao, Nicholas Piggin, Peter Bergner, Paul E. McKenney,
Rashmica Gupta, Russell Currey, Sahil Mehta, Stewart Smith.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYthsKAAoJEFHr6jzI4aWAaWMQAJ7mAwX98ncoYschPgRmmIun
f6DtE4IonrxiZ22gp1ct4+c9OFtA+B5FXMcEhOKpfh93lg38PTDjHs9e5kfauD7+
oTQ2Bg1eXaL48FKdmC5Vs4Kt+/J8e9guGafUC1OVIpTyyRPoZeUDH0lx+kSPV5bd
PkL+wY/k3W0Njo8WgD1P9u3W15+BxISo/k8c7ajzKTHGBZlAvj5h2gO6XUBNMLyy
YClB/qIymjZriSB+AeWYD79k8gPbBZPsmZG0ZF1hY060894LgqLB9mPOJdffx/DY
H7/uP6jcsRDOXTOmyueW1SEmPoQbtysiMd1lNrCXKtC/Okr5uhn2cUhi88AsgWvd
1QFly2lobcDAKPah/yB7YQGMAcmYvGGNuqrWaosaV2T7r0KprzUYYgCOqzvC3WSJ
QtVatBzMIqRTMYq+3U4G1aHeCXlRazVQHDuvPby8RdR5b2gIexiqMab2eS7tSMIH
mCOIunRIvT14g/7wxUV7tahN+ifncNxzAk4DvPO+Wc4FQ4sy7wArv2YipSaWRWtE
u7tNdBkEwlDkKhJgRU5T0Op2PyMbHwCP8pWuz7PQIhKIcgwmP9wb07BIWG/GGIqn
07TxJYX2ItabyEMZMsYhzILZqjLyiAaCARANB7ScbQbdP8wdcGZcwismhwnfROIU
NuxsZg63BUDMoxk7Sauu
=rspd
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc updates from Michael Ellerman:
"Highlights include:
- an update of the disassembly code used by xmon to the latest
versions in binutils. We've received permission from all the
authors of the relevant binutils changes to relicense their changes
to the relevant files from GPLv3 to GPLv2, for inclusion in Linux.
Thanks to Peter Bergner for doing the leg work to get permission
from everyone.
- addition of the "architected" Power9 CPU table entry, allowing us
to boot in Power9 architected mode under a hypervisor.
- updates to the Power9 PMU code.
- implementation of clear_bit_unlock_is_negative_byte() to optimise
unlock_page().
- Freescale updates from Scott: "Highlights include 8xx breakpoints
and perf, t1042rdb display support, and board updates."
Thanks to:
Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Balbir Singh, Douglas
Miller, Frédéric Weisbecker, Gavin Shan, Madhavan Srinivasan,
Michael Roth, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Peter
Bergner, Paul E. McKenney, Rashmica Gupta, Russell Currey, Sahil
Mehta, Stewart Smith"
* tag 'powerpc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
powerpc: Remove leftover cputime_to_nsecs call causing build error
powerpc/mm/hash: Always clear UPRT and Host Radix bits when setting up CPU
powerpc/optprobes: Fix TOC handling in optprobes trampoline
powerpc/pseries: Advertise Hot Plug Event support to firmware
cxl: fix nested locking hang during EEH hotplug
powerpc/xmon: Dump memory in CPU endian format
powerpc/pseries: Revert 'Auto-online hotplugged memory'
powerpc/powernv: Make PCI non-optional
powerpc/64: Implement clear_bit_unlock_is_negative_byte()
powerpc/powernv: Remove unused variable in pnv_pci_sriov_disable()
powerpc/kernel: Remove error message in pcibios_setup_phb_resources()
powerpc/mm: Fix typo in set_pte_at()
pci/hotplug/pnv-php: Disable MSI and PCI device properly
pci/hotplug/pnv-php: Disable surprise hotplug capability on conflicts
pci/hotplug/pnv-php: Remove WARN_ON() in pnv_php_put_slot()
powerpc: Add POWER9 architected mode to cputable
powerpc/perf: use is_kernel_addr macro in perf_get_misc_flags()
powerpc/perf: Avoid FAB_*_MATCH checks for power9
powerpc/perf: Add restrictions to PMC5 in power9 DD1
powerpc/perf: Use Instruction Counter value
...
2017-03-01 21:10:16 +03:00
OFFSET ( ACCOUNT_USER_TIME , paca_struct , accounting . utime ) ;
OFFSET ( ACCOUNT_SYSTEM_TIME , paca_struct , accounting . stime ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_TRAP_SAVE , paca_struct , trap_save ) ;
OFFSET ( PACA_NAPSTATELOST , paca_struct , nap_state_lost ) ;
OFFSET ( PACA_SPRG_VDSO , paca_struct , sprg_vdso ) ;
2016-05-17 09:33:46 +03:00
# else /* CONFIG_PPC64 */
# ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
2017-02-15 13:41:20 +03:00
OFFSET ( ACCOUNT_STARTTIME , thread_info , accounting . starttime ) ;
OFFSET ( ACCOUNT_STARTTIME_USER , thread_info , accounting . starttime_user ) ;
powerpc updates for 4.11 part 2
Highlights include:
- An update of the disassembly code used by xmon to the latest versions in
binutils. We've received permission from all the authors of the relevant
binutils changes to relicense their changes to the relevant files from GPLv3
to GPLv2, for inclusion in Linux. Thanks to Peter Bergner for doing the leg
work to get permission from everyone.
- Addition of the "architected" Power9 CPU table entry, allowing us to boot
in Power9 architected mode under a hypervisor.
- Updates to the Power9 PMU code.
- Implementation of clear_bit_unlock_is_negative_byte() to optimise
unlock_page().
- Freescale updates from Scott: "Highlights include 8xx breakpoints and perf,
t1042rdb display support, and board updates."
Thanks to:
Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Balbir Singh, Douglas Miller,
Frédéric Weisbecker, Gavin Shan, Madhavan Srinivasan, Michael Roth, Nathan
Fontenot, Naveen N. Rao, Nicholas Piggin, Peter Bergner, Paul E. McKenney,
Rashmica Gupta, Russell Currey, Sahil Mehta, Stewart Smith.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYthsKAAoJEFHr6jzI4aWAaWMQAJ7mAwX98ncoYschPgRmmIun
f6DtE4IonrxiZ22gp1ct4+c9OFtA+B5FXMcEhOKpfh93lg38PTDjHs9e5kfauD7+
oTQ2Bg1eXaL48FKdmC5Vs4Kt+/J8e9guGafUC1OVIpTyyRPoZeUDH0lx+kSPV5bd
PkL+wY/k3W0Njo8WgD1P9u3W15+BxISo/k8c7ajzKTHGBZlAvj5h2gO6XUBNMLyy
YClB/qIymjZriSB+AeWYD79k8gPbBZPsmZG0ZF1hY060894LgqLB9mPOJdffx/DY
H7/uP6jcsRDOXTOmyueW1SEmPoQbtysiMd1lNrCXKtC/Okr5uhn2cUhi88AsgWvd
1QFly2lobcDAKPah/yB7YQGMAcmYvGGNuqrWaosaV2T7r0KprzUYYgCOqzvC3WSJ
QtVatBzMIqRTMYq+3U4G1aHeCXlRazVQHDuvPby8RdR5b2gIexiqMab2eS7tSMIH
mCOIunRIvT14g/7wxUV7tahN+ifncNxzAk4DvPO+Wc4FQ4sy7wArv2YipSaWRWtE
u7tNdBkEwlDkKhJgRU5T0Op2PyMbHwCP8pWuz7PQIhKIcgwmP9wb07BIWG/GGIqn
07TxJYX2ItabyEMZMsYhzILZqjLyiAaCARANB7ScbQbdP8wdcGZcwismhwnfROIU
NuxsZg63BUDMoxk7Sauu
=rspd
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc updates from Michael Ellerman:
"Highlights include:
- an update of the disassembly code used by xmon to the latest
versions in binutils. We've received permission from all the
authors of the relevant binutils changes to relicense their changes
to the relevant files from GPLv3 to GPLv2, for inclusion in Linux.
Thanks to Peter Bergner for doing the leg work to get permission
from everyone.
- addition of the "architected" Power9 CPU table entry, allowing us
to boot in Power9 architected mode under a hypervisor.
- updates to the Power9 PMU code.
- implementation of clear_bit_unlock_is_negative_byte() to optimise
unlock_page().
- Freescale updates from Scott: "Highlights include 8xx breakpoints
and perf, t1042rdb display support, and board updates."
Thanks to:
Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Balbir Singh, Douglas
Miller, Frédéric Weisbecker, Gavin Shan, Madhavan Srinivasan,
Michael Roth, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Peter
Bergner, Paul E. McKenney, Rashmica Gupta, Russell Currey, Sahil
Mehta, Stewart Smith"
* tag 'powerpc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
powerpc: Remove leftover cputime_to_nsecs call causing build error
powerpc/mm/hash: Always clear UPRT and Host Radix bits when setting up CPU
powerpc/optprobes: Fix TOC handling in optprobes trampoline
powerpc/pseries: Advertise Hot Plug Event support to firmware
cxl: fix nested locking hang during EEH hotplug
powerpc/xmon: Dump memory in CPU endian format
powerpc/pseries: Revert 'Auto-online hotplugged memory'
powerpc/powernv: Make PCI non-optional
powerpc/64: Implement clear_bit_unlock_is_negative_byte()
powerpc/powernv: Remove unused variable in pnv_pci_sriov_disable()
powerpc/kernel: Remove error message in pcibios_setup_phb_resources()
powerpc/mm: Fix typo in set_pte_at()
pci/hotplug/pnv-php: Disable MSI and PCI device properly
pci/hotplug/pnv-php: Disable surprise hotplug capability on conflicts
pci/hotplug/pnv-php: Remove WARN_ON() in pnv_php_put_slot()
powerpc: Add POWER9 architected mode to cputable
powerpc/perf: use is_kernel_addr macro in perf_get_misc_flags()
powerpc/perf: Avoid FAB_*_MATCH checks for power9
powerpc/perf: Add restrictions to PMC5 in power9 DD1
powerpc/perf: Use Instruction Counter value
...
2017-03-01 21:10:16 +03:00
OFFSET ( ACCOUNT_USER_TIME , thread_info , accounting . utime ) ;
OFFSET ( ACCOUNT_SYSTEM_TIME , thread_info , accounting . stime ) ;
2016-05-17 09:33:46 +03:00
# endif
2005-10-26 11:05:24 +04:00
# endif /* CONFIG_PPC64 */
2005-09-28 18:35:31 +04:00
/* RTAS */
2017-02-15 13:41:20 +03:00
OFFSET ( RTASBASE , rtas_t , base ) ;
OFFSET ( RTASENTRY , rtas_t , entry ) ;
2005-09-28 18:35:31 +04:00
2005-09-26 10:04:21 +04:00
/* Interrupt register frame */
2008-04-24 00:33:49 +04:00
DEFINE ( INT_FRAME_SIZE , STACK_INT_FRAME_SIZE ) ;
2005-09-26 10:04:21 +04:00
DEFINE ( SWITCH_FRAME_SIZE , STACK_FRAME_OVERHEAD + sizeof ( struct pt_regs ) ) ;
2010-04-16 02:11:55 +04:00
# ifdef CONFIG_PPC64
2005-09-28 18:35:31 +04:00
/* Create extra stack space for SRR0 and SRR1 when calling prom/rtas. */
DEFINE ( PROM_FRAME_SIZE , STACK_FRAME_OVERHEAD + sizeof ( struct pt_regs ) + 16 ) ;
DEFINE ( RTAS_FRAME_SIZE , STACK_FRAME_OVERHEAD + sizeof ( struct pt_regs ) + 16 ) ;
# endif /* CONFIG_PPC64 */
2016-06-02 07:29:47 +03:00
STACK_PT_REGS_OFFSET ( GPR0 , gpr [ 0 ] ) ;
STACK_PT_REGS_OFFSET ( GPR1 , gpr [ 1 ] ) ;
STACK_PT_REGS_OFFSET ( GPR2 , gpr [ 2 ] ) ;
STACK_PT_REGS_OFFSET ( GPR3 , gpr [ 3 ] ) ;
STACK_PT_REGS_OFFSET ( GPR4 , gpr [ 4 ] ) ;
STACK_PT_REGS_OFFSET ( GPR5 , gpr [ 5 ] ) ;
STACK_PT_REGS_OFFSET ( GPR6 , gpr [ 6 ] ) ;
STACK_PT_REGS_OFFSET ( GPR7 , gpr [ 7 ] ) ;
STACK_PT_REGS_OFFSET ( GPR8 , gpr [ 8 ] ) ;
STACK_PT_REGS_OFFSET ( GPR9 , gpr [ 9 ] ) ;
STACK_PT_REGS_OFFSET ( GPR10 , gpr [ 10 ] ) ;
STACK_PT_REGS_OFFSET ( GPR11 , gpr [ 11 ] ) ;
STACK_PT_REGS_OFFSET ( GPR12 , gpr [ 12 ] ) ;
STACK_PT_REGS_OFFSET ( GPR13 , gpr [ 13 ] ) ;
2005-09-28 18:35:31 +04:00
# ifndef CONFIG_PPC64
2016-06-02 07:29:47 +03:00
STACK_PT_REGS_OFFSET ( GPR14 , gpr [ 14 ] ) ;
2005-09-28 18:35:31 +04:00
# endif /* CONFIG_PPC64 */
2005-09-26 10:04:21 +04:00
/*
* Note : these symbols include _ because they overlap with special
* register names
*/
2016-06-02 07:29:47 +03:00
STACK_PT_REGS_OFFSET ( _NIP , nip ) ;
STACK_PT_REGS_OFFSET ( _MSR , msr ) ;
STACK_PT_REGS_OFFSET ( _CTR , ctr ) ;
STACK_PT_REGS_OFFSET ( _LINK , link ) ;
STACK_PT_REGS_OFFSET ( _CCR , ccr ) ;
STACK_PT_REGS_OFFSET ( _XER , xer ) ;
STACK_PT_REGS_OFFSET ( _DAR , dar ) ;
STACK_PT_REGS_OFFSET ( _DSISR , dsisr ) ;
STACK_PT_REGS_OFFSET ( ORIG_GPR3 , orig_gpr3 ) ;
STACK_PT_REGS_OFFSET ( RESULT , result ) ;
STACK_PT_REGS_OFFSET ( _TRAP , trap ) ;
2005-09-28 18:35:31 +04:00
# ifndef CONFIG_PPC64
/*
* The PowerPC 400 - class & Book - E processors have neither the DAR
* nor the DSISR SPRs . Hence , we overload them to hold the similar
* DEAR and ESR SPRs for such processors . For critical interrupts
* we use them to hold SRR0 and SRR1 .
2005-09-26 10:04:21 +04:00
*/
2016-06-02 07:29:47 +03:00
STACK_PT_REGS_OFFSET ( _DEAR , dar ) ;
STACK_PT_REGS_OFFSET ( _ESR , dsisr ) ;
2005-09-28 18:35:31 +04:00
# else /* CONFIG_PPC64 */
2016-06-02 07:29:47 +03:00
STACK_PT_REGS_OFFSET ( SOFTE , softe ) ;
2005-09-28 18:35:31 +04:00
/* These _only_ to be used with {PROM,RTAS}_FRAME_SIZE!!! */
DEFINE ( _SRR0 , STACK_FRAME_OVERHEAD + sizeof ( struct pt_regs ) ) ;
DEFINE ( _SRR1 , STACK_FRAME_OVERHEAD + sizeof ( struct pt_regs ) + 8 ) ;
# endif /* CONFIG_PPC64 */
2009-07-28 05:59:34 +04:00
# if defined(CONFIG_PPC32)
2008-04-30 14:23:21 +04:00
# if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
DEFINE ( EXC_LVL_SIZE , STACK_EXC_LVL_FRAME_SIZE ) ;
DEFINE ( MAS0 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas0 ) ) ;
/* we overload MMUCR for 44x on MAS0 since they are mutually exclusive */
DEFINE ( MMUCR , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas0 ) ) ;
DEFINE ( MAS1 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas1 ) ) ;
DEFINE ( MAS2 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas2 ) ) ;
DEFINE ( MAS3 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas3 ) ) ;
DEFINE ( MAS6 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas6 ) ) ;
DEFINE ( MAS7 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , mas7 ) ) ;
DEFINE ( _SRR0 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , srr0 ) ) ;
DEFINE ( _SRR1 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , srr1 ) ) ;
DEFINE ( _CSRR0 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , csrr0 ) ) ;
DEFINE ( _CSRR1 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , csrr1 ) ) ;
DEFINE ( _DSRR0 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , dsrr0 ) ) ;
DEFINE ( _DSRR1 , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , dsrr1 ) ) ;
DEFINE ( SAVED_KSP_LIMIT , STACK_INT_FRAME_SIZE + offsetof ( struct exception_regs , saved_ksp_limit ) ) ;
# endif
2009-07-28 05:59:34 +04:00
# endif
2005-09-28 18:35:31 +04:00
# ifndef CONFIG_PPC64
2017-02-15 13:41:20 +03:00
OFFSET ( MM_PGD , mm_struct , pgd ) ;
2005-09-28 18:35:31 +04:00
# endif /* ! CONFIG_PPC64 */
2005-09-26 10:04:21 +04:00
/* About the CPU features table */
2017-02-15 13:41:20 +03:00
OFFSET ( CPU_SPEC_FEATURES , cpu_spec , cpu_features ) ;
OFFSET ( CPU_SPEC_SETUP , cpu_spec , cpu_setup ) ;
OFFSET ( CPU_SPEC_RESTORE , cpu_spec , cpu_restore ) ;
2005-09-26 10:04:21 +04:00
2017-02-15 13:41:20 +03:00
OFFSET ( pbe_address , pbe , address ) ;
OFFSET ( pbe_orig_address , pbe , orig_address ) ;
OFFSET ( pbe_next , pbe , next ) ;
2005-09-26 10:04:21 +04:00
2007-05-03 16:31:38 +04:00
# ifndef CONFIG_PPC64
2005-10-11 16:08:12 +04:00
DEFINE ( TASK_SIZE , TASK_SIZE ) ;
2005-09-28 18:35:31 +04:00
DEFINE ( NUM_USER_SEGMENTS , TASK_SIZE > > 28 ) ;
2005-11-11 13:15:21 +03:00
# endif /* ! CONFIG_PPC64 */
2005-09-26 10:04:21 +04:00
2005-11-11 13:15:21 +03:00
/* datapage offsets for use by vdso */
2017-02-15 13:41:20 +03:00
OFFSET ( CFG_TB_ORIG_STAMP , vdso_data , tb_orig_stamp ) ;
OFFSET ( CFG_TB_TICKS_PER_SEC , vdso_data , tb_ticks_per_sec ) ;
OFFSET ( CFG_TB_TO_XS , vdso_data , tb_to_xs ) ;
OFFSET ( CFG_TB_UPDATE_COUNT , vdso_data , tb_update_count ) ;
OFFSET ( CFG_TZ_MINUTEWEST , vdso_data , tz_minuteswest ) ;
OFFSET ( CFG_TZ_DSTTIME , vdso_data , tz_dsttime ) ;
OFFSET ( CFG_SYSCALL_MAP32 , vdso_data , syscall_map_32 ) ;
OFFSET ( WTOM_CLOCK_SEC , vdso_data , wtom_clock_sec ) ;
OFFSET ( WTOM_CLOCK_NSEC , vdso_data , wtom_clock_nsec ) ;
OFFSET ( STAMP_XTIME , vdso_data , stamp_xtime ) ;
OFFSET ( STAMP_SEC_FRAC , vdso_data , stamp_sec_fraction ) ;
OFFSET ( CFG_ICACHE_BLOCKSZ , vdso_data , icache_block_size ) ;
OFFSET ( CFG_DCACHE_BLOCKSZ , vdso_data , dcache_block_size ) ;
OFFSET ( CFG_ICACHE_LOGBLOCKSZ , vdso_data , icache_log_block_size ) ;
OFFSET ( CFG_DCACHE_LOGBLOCKSZ , vdso_data , dcache_log_block_size ) ;
2005-11-11 13:15:21 +03:00
# ifdef CONFIG_PPC64
2017-02-15 13:41:20 +03:00
OFFSET ( CFG_SYSCALL_MAP64 , vdso_data , syscall_map_64 ) ;
OFFSET ( TVAL64_TV_SEC , timeval , tv_sec ) ;
OFFSET ( TVAL64_TV_USEC , timeval , tv_usec ) ;
OFFSET ( TVAL32_TV_SEC , compat_timeval , tv_sec ) ;
OFFSET ( TVAL32_TV_USEC , compat_timeval , tv_usec ) ;
OFFSET ( TSPC64_TV_SEC , timespec , tv_sec ) ;
OFFSET ( TSPC64_TV_NSEC , timespec , tv_nsec ) ;
OFFSET ( TSPC32_TV_SEC , compat_timespec , tv_sec ) ;
OFFSET ( TSPC32_TV_NSEC , compat_timespec , tv_nsec ) ;
2005-11-11 13:15:21 +03:00
# else
2017-02-15 13:41:20 +03:00
OFFSET ( TVAL32_TV_SEC , timeval , tv_sec ) ;
OFFSET ( TVAL32_TV_USEC , timeval , tv_usec ) ;
OFFSET ( TSPC32_TV_SEC , timespec , tv_sec ) ;
OFFSET ( TSPC32_TV_NSEC , timespec , tv_nsec ) ;
2005-11-11 13:15:21 +03:00
# endif
/* timeval/timezone offsets for use by vdso */
2017-02-15 13:41:20 +03:00
OFFSET ( TZONE_TZ_MINWEST , timezone , tz_minuteswest ) ;
OFFSET ( TZONE_TZ_DSTTIME , timezone , tz_dsttime ) ;
2005-11-11 13:15:21 +03:00
/* Other bits used by the vdso */
DEFINE ( CLOCK_REALTIME , CLOCK_REALTIME ) ;
DEFINE ( CLOCK_MONOTONIC , CLOCK_MONOTONIC ) ;
2017-10-16 08:49:14 +03:00
DEFINE ( CLOCK_REALTIME_COARSE , CLOCK_REALTIME_COARSE ) ;
DEFINE ( CLOCK_MONOTONIC_COARSE , CLOCK_MONOTONIC_COARSE ) ;
2005-11-11 13:15:21 +03:00
DEFINE ( NSEC_PER_SEC , NSEC_PER_SEC ) ;
2008-02-08 01:24:52 +03:00
DEFINE ( CLOCK_REALTIME_RES , MONOTONIC_RES_NSEC ) ;
2005-11-11 13:15:21 +03:00
2007-01-01 21:45:34 +03:00
# ifdef CONFIG_BUG
DEFINE ( BUG_ENTRY_SIZE , sizeof ( struct bug_entry ) ) ;
# endif
2007-08-20 08:58:36 +04:00
2017-04-12 07:56:36 +03:00
# ifdef CONFIG_PPC_BOOK3S_64
DEFINE ( PGD_TABLE_SIZE , ( sizeof ( pgd_t ) < < max ( RADIX_PGD_INDEX_SIZE , H_PGD_INDEX_SIZE ) ) ) ;
2016-04-29 16:25:49 +03:00
# else
2007-09-18 11:22:59 +04:00
DEFINE ( PGD_TABLE_SIZE , PGD_TABLE_SIZE ) ;
2016-04-29 16:25:49 +03:00
# endif
2008-09-24 20:01:24 +04:00
DEFINE ( PTE_SIZE , sizeof ( pte_t ) ) ;
2007-12-06 22:11:04 +03:00
2008-04-17 08:28:09 +04:00
# ifdef CONFIG_KVM
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_HOST_STACK , kvm_vcpu , arch . host_stack ) ;
OFFSET ( VCPU_HOST_PID , kvm_vcpu , arch . host_pid ) ;
OFFSET ( VCPU_GUEST_PID , kvm_vcpu , arch . pid ) ;
2018-05-07 09:20:07 +03:00
OFFSET ( VCPU_GPRS , kvm_vcpu , arch . regs . gpr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_VRSAVE , kvm_vcpu , arch . vrsave ) ;
OFFSET ( VCPU_FPRS , kvm_vcpu , arch . fp . fpr ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
# ifdef CONFIG_ALTIVEC
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_VRS , kvm_vcpu , arch . vr . vr ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
# endif
2018-05-07 09:20:08 +03:00
OFFSET ( VCPU_XER , kvm_vcpu , arch . regs . xer ) ;
OFFSET ( VCPU_CTR , kvm_vcpu , arch . regs . ctr ) ;
OFFSET ( VCPU_LR , kvm_vcpu , arch . regs . link ) ;
2014-04-22 14:26:58 +04:00
# ifdef CONFIG_PPC_BOOK3S
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_TAR , kvm_vcpu , arch . tar ) ;
2014-04-22 14:26:58 +04:00
# endif
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_CR , kvm_vcpu , arch . cr ) ;
2018-05-07 09:20:08 +03:00
OFFSET ( VCPU_PC , kvm_vcpu , arch . regs . nip ) ;
2013-10-07 20:47:52 +04:00
# ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_MSR , kvm_vcpu , arch . shregs . msr ) ;
OFFSET ( VCPU_SRR0 , kvm_vcpu , arch . shregs . srr0 ) ;
OFFSET ( VCPU_SRR1 , kvm_vcpu , arch . shregs . srr1 ) ;
OFFSET ( VCPU_SPRG0 , kvm_vcpu , arch . shregs . sprg0 ) ;
OFFSET ( VCPU_SPRG1 , kvm_vcpu , arch . shregs . sprg1 ) ;
OFFSET ( VCPU_SPRG2 , kvm_vcpu , arch . shregs . sprg2 ) ;
OFFSET ( VCPU_SPRG3 , kvm_vcpu , arch . shregs . sprg3 ) ;
KVM: PPC: Book3S HV: Accumulate timing information for real-mode code
This reads the timebase at various points in the real-mode guest
entry/exit code and uses that to accumulate total, minimum and
maximum time spent in those parts of the code. Currently these
times are accumulated per vcpu in 5 parts of the code:
* rm_entry - time taken from the start of kvmppc_hv_entry() until
just before entering the guest.
* rm_intr - time from when we take a hypervisor interrupt in the
guest until we either re-enter the guest or decide to exit to the
host. This includes time spent handling hcalls in real mode.
* rm_exit - time from when we decide to exit the guest until the
return from kvmppc_hv_entry().
* guest - time spend in the guest
* cede - time spent napping in real mode due to an H_CEDE hcall
while other threads in the same vcore are active.
These times are exposed in debugfs in a directory per vcpu that
contains a file called "timings". This file contains one line for
each of the 5 timings above, with the name followed by a colon and
4 numbers, which are the count (number of times the code has been
executed), the total time, the minimum time, and the maximum time,
all in nanoseconds.
The overhead of the extra code amounts to about 30ns for an hcall that
is handled in real mode (e.g. H_SET_DABR), which is about 25%. Since
production environments may not wish to incur this overhead, the new
code is conditional on a new config symbol,
CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-28 06:21:02 +03:00
# endif
# ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_TB_RMENTRY , kvm_vcpu , arch . rm_entry ) ;
OFFSET ( VCPU_TB_RMINTR , kvm_vcpu , arch . rm_intr ) ;
OFFSET ( VCPU_TB_RMEXIT , kvm_vcpu , arch . rm_exit ) ;
OFFSET ( VCPU_TB_GUEST , kvm_vcpu , arch . guest_time ) ;
OFFSET ( VCPU_TB_CEDE , kvm_vcpu , arch . cede_time ) ;
OFFSET ( VCPU_CUR_ACTIVITY , kvm_vcpu , arch . cur_activity ) ;
OFFSET ( VCPU_ACTIVITY_START , kvm_vcpu , arch . cur_tb_start ) ;
OFFSET ( TAS_SEQCOUNT , kvmhv_tb_accumulator , seqcount ) ;
OFFSET ( TAS_TOTAL , kvmhv_tb_accumulator , tb_total ) ;
OFFSET ( TAS_MIN , kvmhv_tb_accumulator , tb_min ) ;
OFFSET ( TAS_MAX , kvmhv_tb_accumulator , tb_max ) ;
# endif
OFFSET ( VCPU_SHARED_SPRG3 , kvm_vcpu_arch_shared , sprg3 ) ;
OFFSET ( VCPU_SHARED_SPRG4 , kvm_vcpu_arch_shared , sprg4 ) ;
OFFSET ( VCPU_SHARED_SPRG5 , kvm_vcpu_arch_shared , sprg5 ) ;
OFFSET ( VCPU_SHARED_SPRG6 , kvm_vcpu_arch_shared , sprg6 ) ;
OFFSET ( VCPU_SHARED_SPRG7 , kvm_vcpu_arch_shared , sprg7 ) ;
OFFSET ( VCPU_SHADOW_PID , kvm_vcpu , arch . shadow_pid ) ;
OFFSET ( VCPU_SHADOW_PID1 , kvm_vcpu , arch . shadow_pid1 ) ;
OFFSET ( VCPU_SHARED , kvm_vcpu , arch . shared ) ;
OFFSET ( VCPU_SHARED_MSR , kvm_vcpu_arch_shared , msr ) ;
OFFSET ( VCPU_SHADOW_MSR , kvm_vcpu , arch . shadow_msr ) ;
2014-04-24 15:46:24 +04:00
# if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_PR_POSSIBLE)
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_SHAREDBE , kvm_vcpu , arch . shared_big_endian ) ;
2014-04-24 15:46:24 +04:00
# endif
2008-04-17 08:28:09 +04:00
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_SHARED_MAS0 , kvm_vcpu_arch_shared , mas0 ) ;
OFFSET ( VCPU_SHARED_MAS1 , kvm_vcpu_arch_shared , mas1 ) ;
OFFSET ( VCPU_SHARED_MAS2 , kvm_vcpu_arch_shared , mas2 ) ;
OFFSET ( VCPU_SHARED_MAS7_3 , kvm_vcpu_arch_shared , mas7_3 ) ;
OFFSET ( VCPU_SHARED_MAS4 , kvm_vcpu_arch_shared , mas4 ) ;
OFFSET ( VCPU_SHARED_MAS6 , kvm_vcpu_arch_shared , mas6 ) ;
KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn
This allows additional registers to be accessed by the guest
in PR-mode KVM without trapping.
SPRG4-7 are readable from userspace. On booke, KVM will sync
these registers when it enters the guest, so that accesses from
guest userspace will work. The guest kernel, OTOH, must consistently
use either the real registers or the shared area between exits. This
also applies to the already-paravirted SPRG3.
On non-booke, it's not clear to what extent SPRG4-7 are supported
(they're not architected for book3s, but exist on at least some classic
chips). They are copied in the get/set regs ioctls, but I do not see any
non-booke emulation. I also do not see any syncing with real registers
(in PR-mode) including the user-readable SPRG3. This patch should not
make that situation any worse.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-11-09 04:23:30 +04:00
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_KVM , kvm_vcpu , kvm ) ;
OFFSET ( KVM_LPID , kvm , arch . lpid ) ;
2011-12-20 19:34:43 +04:00
2010-04-16 02:11:42 +04:00
/* book3s */
2013-10-07 20:47:52 +04:00
# ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
2017-02-15 13:41:20 +03:00
OFFSET ( KVM_TLB_SETS , kvm , arch . tlb_sets ) ;
OFFSET ( KVM_SDR1 , kvm , arch . sdr1 ) ;
OFFSET ( KVM_HOST_LPID , kvm , arch . host_lpid ) ;
OFFSET ( KVM_HOST_LPCR , kvm , arch . host_lpcr ) ;
OFFSET ( KVM_HOST_SDR1 , kvm , arch . host_sdr1 ) ;
OFFSET ( KVM_NEED_FLUSH , kvm , arch . need_tlb_flush . bits ) ;
OFFSET ( KVM_ENABLED_HCALLS , kvm , arch . enabled_hcalls ) ;
OFFSET ( KVM_VRMA_SLB_V , kvm , arch . vrma_slb_v ) ;
OFFSET ( KVM_RADIX , kvm , arch . radix ) ;
2017-05-11 14:02:48 +03:00
OFFSET ( KVM_FWNMI , kvm , arch . fwnmi_enabled ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_DSISR , kvm_vcpu , arch . shregs . dsisr ) ;
OFFSET ( VCPU_DAR , kvm_vcpu , arch . shregs . dar ) ;
OFFSET ( VCPU_VPA , kvm_vcpu , arch . vpa . pinned_addr ) ;
OFFSET ( VCPU_VPA_DIRTY , kvm_vcpu , arch . vpa . dirty ) ;
OFFSET ( VCPU_HEIR , kvm_vcpu , arch . emul_inst ) ;
OFFSET ( VCPU_CPU , kvm_vcpu , cpu ) ;
OFFSET ( VCPU_THREAD_CPU , kvm_vcpu , arch . thread_cpu ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
# endif
2010-04-16 02:11:42 +04:00
# ifdef CONFIG_PPC_BOOK3S
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_PURR , kvm_vcpu , arch . purr ) ;
OFFSET ( VCPU_SPURR , kvm_vcpu , arch . spurr ) ;
OFFSET ( VCPU_IC , kvm_vcpu , arch . ic ) ;
OFFSET ( VCPU_DSCR , kvm_vcpu , arch . dscr ) ;
OFFSET ( VCPU_AMR , kvm_vcpu , arch . amr ) ;
OFFSET ( VCPU_UAMOR , kvm_vcpu , arch . uamor ) ;
OFFSET ( VCPU_IAMR , kvm_vcpu , arch . iamr ) ;
OFFSET ( VCPU_CTRL , kvm_vcpu , arch . ctrl ) ;
OFFSET ( VCPU_DABR , kvm_vcpu , arch . dabr ) ;
OFFSET ( VCPU_DABRX , kvm_vcpu , arch . dabrx ) ;
OFFSET ( VCPU_DAWR , kvm_vcpu , arch . dawr ) ;
OFFSET ( VCPU_DAWRX , kvm_vcpu , arch . dawrx ) ;
OFFSET ( VCPU_CIABR , kvm_vcpu , arch . ciabr ) ;
OFFSET ( VCPU_HFLAGS , kvm_vcpu , arch . hflags ) ;
OFFSET ( VCPU_DEC , kvm_vcpu , arch . dec ) ;
OFFSET ( VCPU_DEC_EXPIRES , kvm_vcpu , arch . dec_expires ) ;
OFFSET ( VCPU_PENDING_EXC , kvm_vcpu , arch . pending_exceptions ) ;
OFFSET ( VCPU_CEDED , kvm_vcpu , arch . ceded ) ;
OFFSET ( VCPU_PRODDED , kvm_vcpu , arch . prodded ) ;
2018-01-12 05:37:13 +03:00
OFFSET ( VCPU_IRQ_PENDING , kvm_vcpu , arch . irq_pending ) ;
KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9
On POWER9, we no longer have the restriction that we had on POWER8
where all threads in a core have to be in the same partition, so
the CPU threads are now independent. However, we still want to be
able to run guests with a virtual SMT topology, if only to allow
migration of guests from POWER8 systems to POWER9.
A guest that has a virtual SMT mode greater than 1 will expect to
be able to use the doorbell facility; it will expect the msgsndp
and msgclrp instructions to work appropriately and to be able to read
sensible values from the TIR (thread identification register) and
DPDES (directed privileged doorbell exception status) special-purpose
registers. However, since each CPU thread is a separate sub-processor
in POWER9, these instructions and registers can only be used within
a single CPU thread.
In order for these instructions to appear to act correctly according
to the guest's virtual SMT mode, we have to trap and emulate them.
We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
register. The emulation is triggered by the hypervisor facility
unavailable interrupt that occurs when the guest uses them.
To cause a doorbell interrupt to occur within the guest, we set the
DPDES register to 1. If the guest has interrupts enabled, the CPU
will generate a doorbell interrupt and clear the DPDES register in
hardware. The DPDES hardware register for the guest is saved in the
vcpu->arch.vcore->dpdes field. Since this gets written by the guest
exit code, other VCPUs wishing to cause a doorbell interrupt don't
write that field directly, but instead set a vcpu->arch.doorbell_request
flag. This is consumed and set to 0 by the guest entry code, which
then sets DPDES to 1.
Emulating reads of the DPDES register is somewhat involved, because
it requires reading the doorbell pending interrupt status of all of the
VCPU threads in the virtual core, and if any of those VCPUs are
running, their doorbell status is only up-to-date in the hardware
DPDES registers of the CPUs where they are running. In order to get
a reasonable approximation of the current doorbell status, we send
those CPUs an IPI, causing an exit from the guest which will update
the vcpu->arch.vcore->dpdes field. We then use that value in
constructing the emulated DPDES register value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-05-16 09:41:20 +03:00
OFFSET ( VCPU_DBELL_REQ , kvm_vcpu , arch . doorbell_request ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_MMCR , kvm_vcpu , arch . mmcr ) ;
OFFSET ( VCPU_PMC , kvm_vcpu , arch . pmc ) ;
OFFSET ( VCPU_SPMC , kvm_vcpu , arch . spmc ) ;
OFFSET ( VCPU_SIAR , kvm_vcpu , arch . siar ) ;
OFFSET ( VCPU_SDAR , kvm_vcpu , arch . sdar ) ;
OFFSET ( VCPU_SIER , kvm_vcpu , arch . sier ) ;
OFFSET ( VCPU_SLB , kvm_vcpu , arch . slb ) ;
OFFSET ( VCPU_SLB_MAX , kvm_vcpu , arch . slb_max ) ;
OFFSET ( VCPU_SLB_NR , kvm_vcpu , arch . slb_nr ) ;
OFFSET ( VCPU_FAULT_DSISR , kvm_vcpu , arch . fault_dsisr ) ;
OFFSET ( VCPU_FAULT_DAR , kvm_vcpu , arch . fault_dar ) ;
OFFSET ( VCPU_FAULT_GPA , kvm_vcpu , arch . fault_gpa ) ;
OFFSET ( VCPU_INTR_MSR , kvm_vcpu , arch . intr_msr ) ;
OFFSET ( VCPU_LAST_INST , kvm_vcpu , arch . last_inst ) ;
OFFSET ( VCPU_TRAP , kvm_vcpu , arch . trap ) ;
OFFSET ( VCPU_CFAR , kvm_vcpu , arch . cfar ) ;
OFFSET ( VCPU_PPR , kvm_vcpu , arch . ppr ) ;
OFFSET ( VCPU_FSCR , kvm_vcpu , arch . fscr ) ;
OFFSET ( VCPU_PSPB , kvm_vcpu , arch . pspb ) ;
OFFSET ( VCPU_EBBHR , kvm_vcpu , arch . ebbhr ) ;
OFFSET ( VCPU_EBBRR , kvm_vcpu , arch . ebbrr ) ;
OFFSET ( VCPU_BESCR , kvm_vcpu , arch . bescr ) ;
OFFSET ( VCPU_CSIGR , kvm_vcpu , arch . csigr ) ;
OFFSET ( VCPU_TACR , kvm_vcpu , arch . tacr ) ;
OFFSET ( VCPU_TCSCR , kvm_vcpu , arch . tcscr ) ;
OFFSET ( VCPU_ACOP , kvm_vcpu , arch . acop ) ;
OFFSET ( VCPU_WORT , kvm_vcpu , arch . wort ) ;
OFFSET ( VCPU_TID , kvm_vcpu , arch . tid ) ;
OFFSET ( VCPU_PSSCR , kvm_vcpu , arch . psscr ) ;
2017-02-15 06:30:17 +03:00
OFFSET ( VCPU_HFSCR , kvm_vcpu , arch . hfscr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCORE_ENTRY_EXIT , kvmppc_vcore , entry_exit_map ) ;
OFFSET ( VCORE_IN_GUEST , kvmppc_vcore , in_guest ) ;
OFFSET ( VCORE_NAPPING_THREADS , kvmppc_vcore , napping_threads ) ;
OFFSET ( VCORE_KVM , kvmppc_vcore , kvm ) ;
OFFSET ( VCORE_TB_OFFSET , kvmppc_vcore , tb_offset ) ;
KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry
Currently, the HV KVM guest entry/exit code adds the timebase offset
from the vcore struct to the timebase on guest entry, and subtracts
it on guest exit. Which is fine, except that it is possible for
userspace to change the offset using the SET_ONE_REG interface while
the vcore is running, as there is only one timebase offset per vcore
but potentially multiple VCPUs in the vcore. If that were to happen,
KVM would subtract a different offset on guest exit from that which
it had added on guest entry, leading to the timebase being out of sync
between cores in the host, which then leads to bad things happening
such as hangs and spurious watchdog timeouts.
To fix this, we add a new field 'tb_offset_applied' to the vcore struct
which stores the offset that is currently applied to the timebase.
This value is set from the vcore tb_offset field on guest entry, and
is what is subtracted from the timebase on guest exit. Since it is
zero when the timebase offset is not applied, we can simplify the
logic in kvmhv_start_timing and kvmhv_accumulate_time.
In addition, we had secondary threads reading the timebase while
running concurrently with code on the primary thread which would
eventually add or subtract the timebase offset from the timebase.
This occurred while saving or restoring the DEC register value on
the secondary threads. Although no specific incorrect behaviour has
been observed, this is a race which should be fixed. To fix it, we
move the DEC saving code to just before we call kvmhv_commence_exit,
and the DEC restoring code to after the point where we have waited
for the primary thread to switch the MMU context and add the timebase
offset. That way we are sure that the timebase contains the guest
timebase value in both cases.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-04-20 15:51:11 +03:00
OFFSET ( VCORE_TB_OFFSET_APPL , kvmppc_vcore , tb_offset_applied ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCORE_LPCR , kvmppc_vcore , lpcr ) ;
OFFSET ( VCORE_PCR , kvmppc_vcore , pcr ) ;
OFFSET ( VCORE_DPDES , kvmppc_vcore , dpdes ) ;
OFFSET ( VCORE_VTB , kvmppc_vcore , vtb ) ;
OFFSET ( VCPU_SLB_E , kvmppc_slb , orige ) ;
OFFSET ( VCPU_SLB_V , kvmppc_slb , origv ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
DEFINE ( VCPU_SLB_SIZE , sizeof ( struct kvmppc_slb ) ) ;
2014-01-08 14:25:32 +04:00
# ifdef CONFIG_PPC_TRANSACTIONAL_MEM
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_TFHAR , kvm_vcpu , arch . tfhar ) ;
OFFSET ( VCPU_TFIAR , kvm_vcpu , arch . tfiar ) ;
OFFSET ( VCPU_TEXASR , kvm_vcpu , arch . texasr ) ;
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 13:32:01 +03:00
OFFSET ( VCPU_ORIG_TEXASR , kvm_vcpu , arch . orig_texasr ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_GPR_TM , kvm_vcpu , arch . gpr_tm ) ;
OFFSET ( VCPU_FPRS_TM , kvm_vcpu , arch . fp_tm . fpr ) ;
OFFSET ( VCPU_VRS_TM , kvm_vcpu , arch . vr_tm . vr ) ;
OFFSET ( VCPU_VRSAVE_TM , kvm_vcpu , arch . vrsave_tm ) ;
OFFSET ( VCPU_CR_TM , kvm_vcpu , arch . cr_tm ) ;
OFFSET ( VCPU_XER_TM , kvm_vcpu , arch . xer_tm ) ;
OFFSET ( VCPU_LR_TM , kvm_vcpu , arch . lr_tm ) ;
OFFSET ( VCPU_CTR_TM , kvm_vcpu , arch . ctr_tm ) ;
OFFSET ( VCPU_AMR_TM , kvm_vcpu , arch . amr_tm ) ;
OFFSET ( VCPU_PPR_TM , kvm_vcpu , arch . ppr_tm ) ;
OFFSET ( VCPU_DSCR_TM , kvm_vcpu , arch . dscr_tm ) ;
OFFSET ( VCPU_TAR_TM , kvm_vcpu , arch . tar_tm ) ;
2014-01-08 14:25:32 +04:00
# endif
2011-06-29 04:20:58 +04:00
# ifdef CONFIG_PPC_BOOK3S_64
2013-10-07 20:47:51 +04:00
# ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_SVCPU , paca_struct , shadow_vcpu ) ;
2011-06-29 04:20:58 +04:00
# define SVCPU_FIELD(x, f) DEFINE(x, offsetof(struct paca_struct, shadow_vcpu.f))
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
# else
# define SVCPU_FIELD(x, f)
# endif
2011-06-29 04:20:58 +04:00
# define HSTATE_FIELD(x, f) DEFINE(x, offsetof(struct paca_struct, kvm_hstate.f))
# else /* 32-bit */
# define SVCPU_FIELD(x, f) DEFINE(x, offsetof(struct kvmppc_book3s_shadow_vcpu, f))
# define HSTATE_FIELD(x, f) DEFINE(x, offsetof(struct kvmppc_book3s_shadow_vcpu, hstate.f))
# endif
SVCPU_FIELD ( SVCPU_CR , cr ) ;
SVCPU_FIELD ( SVCPU_XER , xer ) ;
SVCPU_FIELD ( SVCPU_CTR , ctr ) ;
SVCPU_FIELD ( SVCPU_LR , lr ) ;
SVCPU_FIELD ( SVCPU_PC , pc ) ;
SVCPU_FIELD ( SVCPU_R0 , gpr [ 0 ] ) ;
SVCPU_FIELD ( SVCPU_R1 , gpr [ 1 ] ) ;
SVCPU_FIELD ( SVCPU_R2 , gpr [ 2 ] ) ;
SVCPU_FIELD ( SVCPU_R3 , gpr [ 3 ] ) ;
SVCPU_FIELD ( SVCPU_R4 , gpr [ 4 ] ) ;
SVCPU_FIELD ( SVCPU_R5 , gpr [ 5 ] ) ;
SVCPU_FIELD ( SVCPU_R6 , gpr [ 6 ] ) ;
SVCPU_FIELD ( SVCPU_R7 , gpr [ 7 ] ) ;
SVCPU_FIELD ( SVCPU_R8 , gpr [ 8 ] ) ;
SVCPU_FIELD ( SVCPU_R9 , gpr [ 9 ] ) ;
SVCPU_FIELD ( SVCPU_R10 , gpr [ 10 ] ) ;
SVCPU_FIELD ( SVCPU_R11 , gpr [ 11 ] ) ;
SVCPU_FIELD ( SVCPU_R12 , gpr [ 12 ] ) ;
SVCPU_FIELD ( SVCPU_R13 , gpr [ 13 ] ) ;
SVCPU_FIELD ( SVCPU_FAULT_DSISR , fault_dsisr ) ;
SVCPU_FIELD ( SVCPU_FAULT_DAR , fault_dar ) ;
SVCPU_FIELD ( SVCPU_LAST_INST , last_inst ) ;
SVCPU_FIELD ( SVCPU_SHADOW_SRR1 , shadow_srr1 ) ;
2010-04-16 02:11:44 +04:00
# ifdef CONFIG_PPC_BOOK3S_32
2011-06-29 04:20:58 +04:00
SVCPU_FIELD ( SVCPU_SR , sr ) ;
2010-04-16 02:11:44 +04:00
# endif
2011-06-29 04:20:58 +04:00
# ifdef CONFIG_PPC64
SVCPU_FIELD ( SVCPU_SLB , slb ) ;
SVCPU_FIELD ( SVCPU_SLB_MAX , slb_max ) ;
2014-04-29 18:48:44 +04:00
SVCPU_FIELD ( SVCPU_SHADOW_FSCR , shadow_fscr ) ;
2011-06-29 04:20:58 +04:00
# endif
HSTATE_FIELD ( HSTATE_HOST_R1 , host_r1 ) ;
HSTATE_FIELD ( HSTATE_HOST_R2 , host_r2 ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
HSTATE_FIELD ( HSTATE_HOST_MSR , host_msr ) ;
2011-06-29 04:20:58 +04:00
HSTATE_FIELD ( HSTATE_VMHANDLER , vmhandler ) ;
HSTATE_FIELD ( HSTATE_SCRATCH0 , scratch0 ) ;
HSTATE_FIELD ( HSTATE_SCRATCH1 , scratch1 ) ;
2013-11-11 17:59:47 +04:00
HSTATE_FIELD ( HSTATE_SCRATCH2 , scratch2 ) ;
2011-06-29 04:20:58 +04:00
HSTATE_FIELD ( HSTATE_IN_GUEST , in_guest ) ;
2011-07-23 11:41:44 +04:00
HSTATE_FIELD ( HSTATE_RESTORE_HID5 , restore_hid5 ) ;
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
core), whenever a CPU goes idle, we have to pull all the other
hardware threads in the core out of the guest, because the H_CEDE
hcall is handled in the kernel. This is inefficient.
This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
in real mode. When a guest vcpu does an H_CEDE hcall, we now only
exit to the kernel if all the other vcpus in the same core are also
idle. Otherwise we mark this vcpu as napping, save state that could
be lost in nap mode (mainly GPRs and FPRs), and execute the nap
instruction. When the thread wakes up, because of a decrementer or
external interrupt, we come back in at kvm_start_guest (from the
system reset interrupt vector), find the `napping' flag set in the
paca, and go to the resume path.
This has some other ramifications. First, when starting a core, we
now start all the threads, both those that are immediately runnable and
those that are idle. This is so that we don't have to pull all the
threads out of the guest when an idle thread gets a decrementer interrupt
and wants to start running. In fact the idle threads will all start
with the H_CEDE hcall returning; being idle they will just do another
H_CEDE immediately and go to nap mode.
This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
These functions have been restructured to make them simpler and clearer.
We introduce a level of indirection in the wait queue that gets woken
when external and decrementer interrupts get generated for a vcpu, so
that we can have the 4 vcpus in a vcore using the same wait queue.
We need this because the 4 vcpus are being handled by one thread.
Secondly, when we need to exit from the guest to the kernel, we now
have to generate an IPI for any napping threads, because an HDEC
interrupt doesn't wake up a napping thread.
Thirdly, we now need to be able to handle virtual external interrupts
and decrementer interrupts becoming pending while a thread is napping,
and deliver those interrupts to the guest when the thread wakes.
This is done in kvmppc_cede_reentry, just before fast_guest_return.
Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
from kvm_arch_vcpu_runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 11:42:46 +04:00
HSTATE_FIELD ( HSTATE_NAPPING , napping ) ;
2011-06-29 04:20:58 +04:00
2013-10-07 20:47:52 +04:00
# ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
2012-03-06 01:42:25 +04:00
HSTATE_FIELD ( HSTATE_HWTHREAD_REQ , hwthread_req ) ;
HSTATE_FIELD ( HSTATE_HWTHREAD_STATE , hwthread_state ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
HSTATE_FIELD ( HSTATE_KVM_VCPU , kvm_vcpu ) ;
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:23:08 +04:00
HSTATE_FIELD ( HSTATE_KVM_VCORE , kvm_vcore ) ;
HSTATE_FIELD ( HSTATE_XICS_PHYS , xics_phys ) ;
2017-04-05 10:54:56 +03:00
HSTATE_FIELD ( HSTATE_XIVE_TIMA_PHYS , xive_tima_phys ) ;
HSTATE_FIELD ( HSTATE_XIVE_TIMA_VIRT , xive_tima_virt ) ;
2013-04-18 00:30:50 +04:00
HSTATE_FIELD ( HSTATE_SAVED_XIRR , saved_xirr ) ;
HSTATE_FIELD ( HSTATE_HOST_IPI , host_ipi ) ;
KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbers
On a threaded processor such as POWER7, we group VCPUs into virtual
cores and arrange that the VCPUs in a virtual core run on the same
physical core. Currently we don't enforce any correspondence between
virtual thread numbers within a virtual core and physical thread
numbers. Physical threads are allocated starting at 0 on a first-come
first-served basis to runnable virtual threads (VCPUs).
POWER8 implements a new "msgsndp" instruction which guest kernels can
use to interrupt other threads in the same core or sub-core. Since
the instruction takes the destination physical thread ID as a parameter,
it becomes necessary to align the physical thread IDs with the virtual
thread IDs, that is, to make sure virtual thread N within a virtual
core always runs on physical thread N.
This means that it's possible that thread 0, which is where we call
__kvmppc_vcore_entry, may end up running some other vcpu than the
one whose task called kvmppc_run_core(), or it may end up running
no vcpu at all, if for example thread 0 of the virtual core is
currently executing in userspace. However, we do need thread 0
to be responsible for switching the MMU -- a previous version of
this patch that had other threads switching the MMU was found to
be responsible for occasional memory corruption and machine check
interrupts in the guest on POWER7 machines.
To accommodate this, we no longer pass the vcpu pointer to
__kvmppc_vcore_entry, but instead let the assembly code load it from
the PACA. Since the assembly code will need to know the kvm pointer
and the thread ID for threads which don't have a vcpu, we move the
thread ID into the PACA and we add a kvm pointer to the virtual core
structure.
In the case where thread 0 has no vcpu to run, it still calls into
kvmppc_hv_entry in order to do the MMU switch, and then naps until
either its vcpu is ready to run in the guest, or some other thread
needs to exit the guest. In the latter case, thread 0 jumps to the
code that switches the MMU back to the host. This control flow means
that now we switch the MMU before loading any guest vcpu state.
Similarly, on guest exit we now save all the guest vcpu state before
switching the MMU back to the host. This has required substantial
code movement, making the diff rather large.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-08 14:25:20 +04:00
HSTATE_FIELD ( HSTATE_PTID , ptid ) ;
KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts
This patch removes the restriction that a radix host can only run
radix guests, allowing us to run HPT (hashed page table) guests as
well. This is useful because it provides a way to run old guest
kernels that know about POWER8 but not POWER9.
Unfortunately, POWER9 currently has a restriction that all threads
in a given code must either all be in HPT mode, or all in radix mode.
This means that when entering a HPT guest, we have to obtain control
of all 4 threads in the core and get them to switch their LPIDR and
LPCR registers, even if they are not going to run a guest. On guest
exit we also have to get all threads to switch LPIDR and LPCR back
to host values.
To make this feasible, we require that KVM not be in the "independent
threads" mode, and that the CPU cores be in single-threaded mode from
the host kernel's perspective (only thread 0 online; threads 1, 2 and
3 offline). That allows us to use the same code as on POWER8 for
obtaining control of the secondary threads.
To manage the LPCR/LPIDR changes required, we extend the kvm_split_info
struct to contain the information needed by the secondary threads.
All threads perform a barrier synchronization (where all threads wait
for every other thread to reach the synchronization point) on guest
entry, both before and after loading LPCR and LPIDR. On guest exit,
they all once again perform a barrier synchronization both before
and after loading host values into LPCR and LPIDR.
Finally, it is also currently necessary to flush the entire TLB every
time we enter a HPT guest on a radix host. We do this on thread 0
with a loop of tlbiel instructions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-19 06:11:23 +03:00
HSTATE_FIELD ( HSTATE_TID , tid ) ;
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 13:32:01 +03:00
HSTATE_FIELD ( HSTATE_FAKE_SUSPEND , fake_suspend ) ;
2014-07-10 13:34:31 +04:00
HSTATE_FIELD ( HSTATE_MMCR0 , host_mmcr [ 0 ] ) ;
HSTATE_FIELD ( HSTATE_MMCR1 , host_mmcr [ 1 ] ) ;
HSTATE_FIELD ( HSTATE_MMCRA , host_mmcr [ 2 ] ) ;
HSTATE_FIELD ( HSTATE_SIAR , host_mmcr [ 3 ] ) ;
HSTATE_FIELD ( HSTATE_SDAR , host_mmcr [ 4 ] ) ;
HSTATE_FIELD ( HSTATE_MMCR2 , host_mmcr [ 5 ] ) ;
HSTATE_FIELD ( HSTATE_SIER , host_mmcr [ 6 ] ) ;
HSTATE_FIELD ( HSTATE_PMC1 , host_pmc [ 0 ] ) ;
HSTATE_FIELD ( HSTATE_PMC2 , host_pmc [ 1 ] ) ;
HSTATE_FIELD ( HSTATE_PMC3 , host_pmc [ 2 ] ) ;
HSTATE_FIELD ( HSTATE_PMC4 , host_pmc [ 3 ] ) ;
HSTATE_FIELD ( HSTATE_PMC5 , host_pmc [ 4 ] ) ;
HSTATE_FIELD ( HSTATE_PMC6 , host_pmc [ 5 ] ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
HSTATE_FIELD ( HSTATE_PURR , host_purr ) ;
HSTATE_FIELD ( HSTATE_SPURR , host_spurr ) ;
HSTATE_FIELD ( HSTATE_DSCR , host_dscr ) ;
HSTATE_FIELD ( HSTATE_DABR , dabr ) ;
HSTATE_FIELD ( HSTATE_DECEXP , dec_expires ) ;
KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8
This builds on the ability to run more than one vcore on a physical
core by using the micro-threading (split-core) modes of the POWER8
chip. Previously, only vcores from the same VM could be run together,
and (on POWER8) only if they had just one thread per core. With the
ability to split the core on guest entry and unsplit it on guest exit,
we can run up to 8 vcpu threads from up to 4 different VMs, and we can
run multiple vcores with 2 or 4 vcpus per vcore.
Dynamic micro-threading is only available if the static configuration
of the cores is whole-core mode (unsplit), and only on POWER8.
To manage this, we introduce a new kvm_split_mode struct which is
shared across all of the subcores in the core, with a pointer in the
paca on each thread. In addition we extend the core_info struct to
have information on each subcore. When deciding whether to add a
vcore to the set already on the core, we now have two possibilities:
(a) piggyback the vcore onto an existing subcore, or (b) start a new
subcore.
Currently, when any vcpu needs to exit the guest and switch to host
virtual mode, we interrupt all the threads in all subcores and switch
the core back to whole-core mode. It may be possible in future to
allow some of the subcores to keep executing in the guest while
subcore 0 switches to the host, but that is not implemented in this
patch.
This adds a module parameter called dynamic_mt_modes which controls
which micro-threading (split-core) modes the code will consider, as a
bitmap. In other words, if it is 0, no micro-threading mode is
considered; if it is 2, only 2-way micro-threading is considered; if
it is 4, only 4-way, and if it is 6, both 2-way and 4-way
micro-threading mode will be considered. The default is 6.
With this, we now have secondary threads which are the primary thread
for their subcore and therefore need to do the MMU switch. These
threads will need to be started even if they have no vcpu to run, so
we use the vcore pointer in the PACA rather than the vcpu pointer to
trigger them.
It is now possible for thread 0 to find that an exit has been
requested before it gets to switch the subcore state to the guest. In
that case we haven't added the guest's timebase offset to the
timebase, so we need to be careful not to subtract the offset in the
guest exit path. In fact we just skip the whole path that switches
back to host context, since we haven't switched to the guest context.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-07-02 13:38:16 +03:00
HSTATE_FIELD ( HSTATE_SPLIT_MODE , kvm_split_mode ) ;
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
core), whenever a CPU goes idle, we have to pull all the other
hardware threads in the core out of the guest, because the H_CEDE
hcall is handled in the kernel. This is inefficient.
This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
in real mode. When a guest vcpu does an H_CEDE hcall, we now only
exit to the kernel if all the other vcpus in the same core are also
idle. Otherwise we mark this vcpu as napping, save state that could
be lost in nap mode (mainly GPRs and FPRs), and execute the nap
instruction. When the thread wakes up, because of a decrementer or
external interrupt, we come back in at kvm_start_guest (from the
system reset interrupt vector), find the `napping' flag set in the
paca, and go to the resume path.
This has some other ramifications. First, when starting a core, we
now start all the threads, both those that are immediately runnable and
those that are idle. This is so that we don't have to pull all the
threads out of the guest when an idle thread gets a decrementer interrupt
and wants to start running. In fact the idle threads will all start
with the H_CEDE hcall returning; being idle they will just do another
H_CEDE immediately and go to nap mode.
This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
These functions have been restructured to make them simpler and clearer.
We introduce a level of indirection in the wait queue that gets woken
when external and decrementer interrupts get generated for a vcpu, so
that we can have the 4 vcpus in a vcore using the same wait queue.
We need this because the 4 vcpus are being handled by one thread.
Secondly, when we need to exit from the guest to the kernel, we now
have to generate an IPI for any napping threads, because an HDEC
interrupt doesn't wake up a napping thread.
Thirdly, we now need to be able to handle virtual external interrupts
and decrementer interrupts becoming pending while a thread is napping,
and deliver those interrupts to the guest when the thread wakes.
This is done in kvmppc_cede_reentry, just before fast_guest_return.
Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
from kvm_arch_vcpu_runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 11:42:46 +04:00
DEFINE ( IPI_PRIORITY , IPI_PRIORITY ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( KVM_SPLIT_RPR , kvm_split_mode , rpr ) ;
OFFSET ( KVM_SPLIT_PMMAR , kvm_split_mode , pmmar ) ;
OFFSET ( KVM_SPLIT_LDBAR , kvm_split_mode , ldbar ) ;
OFFSET ( KVM_SPLIT_DO_NAP , kvm_split_mode , do_nap ) ;
OFFSET ( KVM_SPLIT_NAPPED , kvm_split_mode , napped ) ;
KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts
This patch removes the restriction that a radix host can only run
radix guests, allowing us to run HPT (hashed page table) guests as
well. This is useful because it provides a way to run old guest
kernels that know about POWER8 but not POWER9.
Unfortunately, POWER9 currently has a restriction that all threads
in a given code must either all be in HPT mode, or all in radix mode.
This means that when entering a HPT guest, we have to obtain control
of all 4 threads in the core and get them to switch their LPIDR and
LPCR registers, even if they are not going to run a guest. On guest
exit we also have to get all threads to switch LPIDR and LPCR back
to host values.
To make this feasible, we require that KVM not be in the "independent
threads" mode, and that the CPU cores be in single-threaded mode from
the host kernel's perspective (only thread 0 online; threads 1, 2 and
3 offline). That allows us to use the same code as on POWER8 for
obtaining control of the secondary threads.
To manage the LPCR/LPIDR changes required, we extend the kvm_split_info
struct to contain the information needed by the secondary threads.
All threads perform a barrier synchronization (where all threads wait
for every other thread to reach the synchronization point) on guest
entry, both before and after loading LPCR and LPIDR. On guest exit,
they all once again perform a barrier synchronization both before
and after loading host values into LPCR and LPIDR.
Finally, it is also currently necessary to flush the entire TLB every
time we enter a HPT guest on a radix host. We do this on thread 0
with a loop of tlbiel instructions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-19 06:11:23 +03:00
OFFSET ( KVM_SPLIT_DO_SET , kvm_split_mode , do_set ) ;
OFFSET ( KVM_SPLIT_DO_RESTORE , kvm_split_mode , do_restore ) ;
2013-10-07 20:47:52 +04:00
# endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
2013-02-04 22:10:51 +04:00
# ifdef CONFIG_PPC_BOOK3S_64
HSTATE_FIELD ( HSTATE_CFAR , cfar ) ;
2013-09-20 08:52:39 +04:00
HSTATE_FIELD ( HSTATE_PPR , ppr ) ;
2014-04-29 18:48:44 +04:00
HSTATE_FIELD ( HSTATE_HOST_FSCR , host_fscr ) ;
2013-02-04 22:10:51 +04:00
# endif /* CONFIG_PPC_BOOK3S_64 */
2011-06-29 04:20:58 +04:00
# else /* CONFIG_PPC_BOOK3S */
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_CR , kvm_vcpu , arch . cr ) ;
2018-05-07 09:20:08 +03:00
OFFSET ( VCPU_XER , kvm_vcpu , arch . regs . xer ) ;
OFFSET ( VCPU_LR , kvm_vcpu , arch . regs . link ) ;
OFFSET ( VCPU_CTR , kvm_vcpu , arch . regs . ctr ) ;
OFFSET ( VCPU_PC , kvm_vcpu , arch . regs . nip ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_SPRG9 , kvm_vcpu , arch . sprg9 ) ;
OFFSET ( VCPU_LAST_INST , kvm_vcpu , arch . last_inst ) ;
OFFSET ( VCPU_FAULT_DEAR , kvm_vcpu , arch . fault_dear ) ;
OFFSET ( VCPU_FAULT_ESR , kvm_vcpu , arch . fault_esr ) ;
OFFSET ( VCPU_CRIT_SAVE , kvm_vcpu , arch . crit_save ) ;
2010-04-16 02:11:42 +04:00
# endif /* CONFIG_PPC_BOOK3S */
2011-06-29 04:20:58 +04:00
# endif /* CONFIG_KVM */
2010-07-29 16:47:57 +04:00
# ifdef CONFIG_KVM_GUEST
2017-02-15 13:41:20 +03:00
OFFSET ( KVM_MAGIC_SCRATCH1 , kvm_vcpu_arch_shared , scratch1 ) ;
OFFSET ( KVM_MAGIC_SCRATCH2 , kvm_vcpu_arch_shared , scratch2 ) ;
OFFSET ( KVM_MAGIC_SCRATCH3 , kvm_vcpu_arch_shared , scratch3 ) ;
OFFSET ( KVM_MAGIC_INT , kvm_vcpu_arch_shared , int_pending ) ;
OFFSET ( KVM_MAGIC_MSR , kvm_vcpu_arch_shared , msr ) ;
OFFSET ( KVM_MAGIC_CRITICAL , kvm_vcpu_arch_shared , critical ) ;
OFFSET ( KVM_MAGIC_SR , kvm_vcpu_arch_shared , sr ) ;
2010-07-29 16:47:57 +04:00
# endif
2008-12-11 04:55:41 +03:00
# ifdef CONFIG_44x
DEFINE ( PGD_T_LOG2 , PGD_T_LOG2 ) ;
DEFINE ( PTE_T_LOG2 , PTE_T_LOG2 ) ;
# endif
2009-10-17 03:48:40 +04:00
# ifdef CONFIG_PPC_FSL_BOOK3E
2010-05-13 23:38:21 +04:00
DEFINE ( TLBCAM_SIZE , sizeof ( struct tlbcam ) ) ;
2017-02-15 13:41:20 +03:00
OFFSET ( TLBCAM_MAS0 , tlbcam , MAS0 ) ;
OFFSET ( TLBCAM_MAS1 , tlbcam , MAS1 ) ;
OFFSET ( TLBCAM_MAS2 , tlbcam , MAS2 ) ;
OFFSET ( TLBCAM_MAS3 , tlbcam , MAS3 ) ;
OFFSET ( TLBCAM_MAS7 , tlbcam , MAS7 ) ;
2010-05-13 23:38:21 +04:00
# endif
2008-04-17 08:28:09 +04:00
2011-06-15 03:34:31 +04:00
# if defined(CONFIG_KVM) && defined(CONFIG_SPE)
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_EVR , kvm_vcpu , arch . evr [ 0 ] ) ;
OFFSET ( VCPU_ACC , kvm_vcpu , arch . acc ) ;
OFFSET ( VCPU_SPEFSCR , kvm_vcpu , arch . spefscr ) ;
OFFSET ( VCPU_HOST_SPEFSCR , kvm_vcpu , arch . host_spefscr ) ;
2011-06-15 03:34:31 +04:00
# endif
2011-12-20 19:34:43 +04:00
# ifdef CONFIG_KVM_BOOKE_HV
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_HOST_MAS4 , kvm_vcpu , arch . host_mas4 ) ;
OFFSET ( VCPU_HOST_MAS6 , kvm_vcpu , arch . host_mas6 ) ;
2011-12-20 19:34:43 +04:00
# endif
2017-04-05 10:54:56 +03:00
# ifdef CONFIG_KVM_XICS
DEFINE ( VCPU_XIVE_SAVED_STATE , offsetof ( struct kvm_vcpu ,
arch . xive_saved_state ) ) ;
DEFINE ( VCPU_XIVE_CAM_WORD , offsetof ( struct kvm_vcpu ,
arch . xive_cam_word ) ) ;
DEFINE ( VCPU_XIVE_PUSHED , offsetof ( struct kvm_vcpu , arch . xive_pushed ) ) ;
2018-01-12 05:37:16 +03:00
DEFINE ( VCPU_XIVE_ESC_ON , offsetof ( struct kvm_vcpu , arch . xive_esc_on ) ) ;
DEFINE ( VCPU_XIVE_ESC_RADDR , offsetof ( struct kvm_vcpu , arch . xive_esc_raddr ) ) ;
DEFINE ( VCPU_XIVE_ESC_VADDR , offsetof ( struct kvm_vcpu , arch . xive_esc_vaddr ) ) ;
2017-04-05 10:54:56 +03:00
# endif
2008-12-03 00:51:57 +03:00
# ifdef CONFIG_KVM_EXIT_TIMING
2017-02-15 13:41:20 +03:00
OFFSET ( VCPU_TIMING_EXIT_TBU , kvm_vcpu , arch . timing_exit . tv32 . tbu ) ;
OFFSET ( VCPU_TIMING_EXIT_TBL , kvm_vcpu , arch . timing_exit . tv32 . tbl ) ;
OFFSET ( VCPU_TIMING_LAST_ENTER_TBU , kvm_vcpu , arch . timing_last_enter . tv32 . tbu ) ;
OFFSET ( VCPU_TIMING_LAST_ENTER_TBL , kvm_vcpu , arch . timing_last_enter . tv32 . tbl ) ;
2008-12-03 00:51:57 +03:00
# endif
2014-12-09 21:56:52 +03:00
# ifdef CONFIG_PPC_POWERNV
2017-02-15 13:41:20 +03:00
OFFSET ( PACA_CORE_IDLE_STATE_PTR , paca_struct , core_idle_state_ptr ) ;
OFFSET ( PACA_THREAD_IDLE_STATE , paca_struct , thread_idle_state ) ;
OFFSET ( PACA_THREAD_MASK , paca_struct , thread_mask ) ;
OFFSET ( PACA_SUBCORE_SIBLING_MASK , paca_struct , subcore_sibling_mask ) ;
2017-05-16 11:49:47 +03:00
OFFSET ( PACA_REQ_PSSCR , paca_struct , requested_psscr ) ;
powerpc/powernv: Provide a way to force a core into SMT4 mode
POWER9 processors up to and including "Nimbus" v2.2 have hardware
bugs relating to transactional memory and thread reconfiguration.
One of these bugs has a workaround which is to get the core into
SMT4 state temporarily. This workaround is only needed when
running bare-metal.
This patch provides a function which gets the core into SMT4 mode
by preventing threads from going to a stop state, and waking up
those which are already in a stop state. Once at least 3 threads
are not in a stop state, the core will be in SMT4 and we can
continue.
To do this, we add a "dont_stop" flag to the paca to tell the
thread not to go into a stop state. If this flag is set,
power9_idle_stop() just returns immediately with a return value
of 0. The pnv_power9_force_smt4_catch() function does the following:
1. Set the dont_stop flag for each thread in the core, except
ourselves (in fact we use an atomic_inc() in case more than
one thread is calling this function concurrently).
2. See how many threads are awake, indicated by their
requested_psscr field in the paca being 0. If this is at
least 3, skip to step 5.
3. Send a doorbell interrupt to each thread that was seen as
being in a stop state in step 2.
4. Until at least 3 threads are awake, scan the threads to which
we sent a doorbell interrupt and check if they are awake now.
This relies on the following properties:
- Once dont_stop is non-zero, requested_psccr can't go from zero to
non-zero, except transiently (and without the thread doing stop).
- requested_psscr being zero guarantees that the thread isn't in
a state-losing stop state where thread reconfiguration could occur.
- Doing stop with a PSSCR value of 0 won't be a state-losing stop
and thus won't allow thread reconfiguration.
- Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core
must be in SMT4 mode, since SMT modes are powers of 2.
This does add a sync to power9_idle_stop(), which is necessary to
provide the correct ordering between setting requested_psscr and
checking dont_stop. The overhead of the sync should be unnoticeable
compared to the latency of going into and out of a stop state.
Because some objected to incurring this extra latency on systems where
the XER[SO] bug is not relevant, I have put the test in
power9_idle_stop inside a feature section. This means that
pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems
without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will
probably hang the system.
In order to cater for uses where the caller has an operation that
has to be done while the core is in SMT4, the core continues to be
kept in SMT4 after pnv_power9_force_smt4_catch() function returns,
until the pnv_power9_force_smt4_release() function is called.
It undoes the effect of step 1 above and allows the other threads
to go into a stop state.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 13:32:00 +03:00
OFFSET ( PACA_DONT_STOP , paca_struct , dont_stop ) ;
powerpc/powernv: Save/Restore additional SPRs for stop4 cpuidle
The stop4 idle state on POWER9 is a deep idle state which loses
hypervisor resources, but whose latency is low enough that it can be
exposed via cpuidle.
Until now, the deep idle states which lose hypervisor resources (eg:
winkle) were only exposed via CPU-Hotplug. Hence currently on wakeup
from such states, barring a few SPRs which need to be restored to
their older value, rest of the SPRS are reinitialized to their values
corresponding to that at boot time.
When stop4 is used in the context of cpuidle, we want these additional
SPRs to be restored to their older value, to ensure that the context
on the CPU coming back from idle is same as it was before going idle.
In this patch, we define a SPR save area in PACA (since we have used
up the volatile register space in the stack) and on POWER9, we restore
SPRN_PID, SPRN_LDBAR, SPRN_FSCR, SPRN_HFSCR, SPRN_MMCRA, SPRN_MMCR1,
SPRN_MMCR2 to the values they had before entering stop.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-21 13:41:37 +03:00
# define STOP_SPR(x, f) OFFSET(x, paca_struct, stop_sprs.f)
STOP_SPR ( STOP_PID , pid ) ;
STOP_SPR ( STOP_LDBAR , ldbar ) ;
STOP_SPR ( STOP_FSCR , fscr ) ;
STOP_SPR ( STOP_HFSCR , hfscr ) ;
STOP_SPR ( STOP_MMCR1 , mmcr1 ) ;
STOP_SPR ( STOP_MMCR2 , mmcr2 ) ;
STOP_SPR ( STOP_MMCRA , mmcra ) ;
2014-12-09 21:56:52 +03:00
# endif
KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8
This uses msgsnd where possible for signalling other threads within
the same core on POWER8 systems, rather than IPIs through the XICS
interrupt controller. This includes waking secondary threads to run
the guest, the interrupts generated by the virtual XICS, and the
interrupts to bring the other threads out of the guest when exiting.
Aggregated statistics from debugfs across vcpus for a guest with 32
vcpus, 8 threads/vcore, running on a POWER8, show this before the
change:
rm_entry: 3387.6ns (228 - 86600, 1008969 samples)
rm_exit: 4561.5ns (12 - 3477452, 1009402 samples)
rm_intr: 1660.0ns (12 - 553050, 3600051 samples)
and this after the change:
rm_entry: 3060.1ns (212 - 65138, 953873 samples)
rm_exit: 4244.1ns (12 - 9693408, 954331 samples)
rm_intr: 1342.3ns (12 - 1104718, 3405326 samples)
for a test of booting Fedora 20 big-endian to the login prompt.
The time taken for a H_PROD hcall (which is handled in the host
kernel) went down from about 35 microseconds to about 16 microseconds
with this change.
The noinline added to kvmppc_run_core turned out to be necessary for
good performance, at least with gcc 4.9.2 as packaged with Fedora 21
and a little-endian POWER8 host.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-28 06:21:12 +03:00
DEFINE ( PPC_DBELL_SERVER , PPC_DBELL_SERVER ) ;
2017-06-13 16:05:48 +03:00
DEFINE ( PPC_DBELL_MSGTYPE , PPC_DBELL_MSGTYPE ) ;
KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8
This uses msgsnd where possible for signalling other threads within
the same core on POWER8 systems, rather than IPIs through the XICS
interrupt controller. This includes waking secondary threads to run
the guest, the interrupts generated by the virtual XICS, and the
interrupts to bring the other threads out of the guest when exiting.
Aggregated statistics from debugfs across vcpus for a guest with 32
vcpus, 8 threads/vcore, running on a POWER8, show this before the
change:
rm_entry: 3387.6ns (228 - 86600, 1008969 samples)
rm_exit: 4561.5ns (12 - 3477452, 1009402 samples)
rm_intr: 1660.0ns (12 - 553050, 3600051 samples)
and this after the change:
rm_entry: 3060.1ns (212 - 65138, 953873 samples)
rm_exit: 4244.1ns (12 - 9693408, 954331 samples)
rm_intr: 1342.3ns (12 - 1104718, 3405326 samples)
for a test of booting Fedora 20 big-endian to the login prompt.
The time taken for a H_PROD hcall (which is handled in the host
kernel) went down from about 35 microseconds to about 16 microseconds
with this change.
The noinline added to kvmppc_run_core turned out to be necessary for
good performance, at least with gcc 4.9.2 as packaged with Fedora 21
and a little-endian POWER8 host.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-28 06:21:12 +03:00
powerpc/8xx: Fix vaddr for IMMR early remap
Memory: 124428K/131072K available (3748K kernel code, 188K rwdata,
648K rodata, 508K init, 290K bss, 6644K reserved)
Kernel virtual memory layout:
* 0xfffdf000..0xfffff000 : fixmap
* 0xfde00000..0xfe000000 : consistent mem
* 0xfddf6000..0xfde00000 : early ioremap
* 0xc9000000..0xfddf6000 : vmalloc & ioremap
SLUB: HWalign=16, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
Today, IMMR is mapped 1:1 at startup
Mapping IMMR 1:1 is just wrong because it may overlap with another
area. On most mpc8xx boards it is OK as IMMR is set to 0xff000000
but for instance on EP88xC board, IMMR is at 0xfa200000 which
overlaps with VM ioremap area
This patch fixes the virtual address for remapping IMMR with the fixmap
regardless of the value of IMMR.
The size of IMMR area is 256kbytes (CPM at offset 0, security engine
at offset 128k) so a 512k page is enough
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-05-17 10:02:43 +03:00
# ifdef CONFIG_PPC_8xx
2016-07-09 11:22:39 +03:00
DEFINE ( VIRT_IMMR_BASE , ( u64 ) __fix_to_virt ( FIX_IMMR_BASE ) ) ;
powerpc/8xx: Fix vaddr for IMMR early remap
Memory: 124428K/131072K available (3748K kernel code, 188K rwdata,
648K rodata, 508K init, 290K bss, 6644K reserved)
Kernel virtual memory layout:
* 0xfffdf000..0xfffff000 : fixmap
* 0xfde00000..0xfe000000 : consistent mem
* 0xfddf6000..0xfde00000 : early ioremap
* 0xc9000000..0xfddf6000 : vmalloc & ioremap
SLUB: HWalign=16, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
Today, IMMR is mapped 1:1 at startup
Mapping IMMR 1:1 is just wrong because it may overlap with another
area. On most mpc8xx boards it is OK as IMMR is set to 0xff000000
but for instance on EP88xC board, IMMR is at 0xfa200000 which
overlaps with VM ioremap area
This patch fixes the virtual address for remapping IMMR with the fixmap
regardless of the value of IMMR.
The size of IMMR area is 256kbytes (CPM at offset 0, security engine
at offset 128k) so a 512k page is enough
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-05-17 10:02:43 +03:00
# endif
2005-09-26 10:04:21 +04:00
return 0 ;
}