2019-05-29 17:12:40 +03:00
/* SPDX-License-Identifier: GPL-2.0-only */
2008-04-17 08:28:09 +04:00
/*
*
* Copyright IBM Corp . 2008
*
* Authors : Hollis Blanchard < hollisb @ us . ibm . com >
*/
# ifndef __POWERPC_KVM_PPC_H__
# define __POWERPC_KVM_PPC_H__
/* This file exists just so we can dereference kvm_vcpu, avoiding nested header
* dependencies . */
# include <linux/mutex.h>
# include <linux/timer.h>
# include <linux/types.h>
# include <linux/kvm_types.h>
# include <linux/kvm_host.h>
2012-09-26 00:31:56 +04:00
# include <linux/bug.h>
2010-01-15 16:49:12 +03:00
# ifdef CONFIG_PPC_BOOK3S
# include <asm/kvm_book3s.h>
2010-04-16 02:11:40 +04:00
# else
# include <asm/kvm_booke.h>
2010-01-15 16:49:12 +03:00
# endif
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:23:08 +04:00
# ifdef CONFIG_KVM_BOOK3S_64_HANDLER
# include <asm/paca.h>
KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE
Currently, the KVM code assumes that if the host kernel is using the
XIVE interrupt controller (the new interrupt controller that first
appeared in POWER9 systems), then the in-kernel XICS emulation will
use the XIVE hardware to deliver interrupts to the guest. However,
this only works when the host is running in hypervisor mode and has
full access to all of the XIVE functionality. It doesn't work in any
nested virtualization scenario, either with PR KVM or nested-HV KVM,
because the XICS-on-XIVE code calls directly into the native-XIVE
routines, which are not initialized and cannot function correctly
because they use OPAL calls, and OPAL is not available in a guest.
This means that using the in-kernel XICS emulation in a nested
hypervisor that is using XIVE as its interrupt controller will cause a
(nested) host kernel crash. To fix this, we change most of the places
where the current code calls xive_enabled() to select between the
XICS-on-XIVE emulation and the plain XICS emulation to call a new
function, xics_on_xive(), which returns false in a guest.
However, there is a further twist. The plain XICS emulation has some
functions which are used in real mode and access the underlying XICS
controller (the interrupt controller of the host) directly. In the
case of a nested hypervisor, this means doing XICS hypercalls
directly. When the nested host is using XIVE as its interrupt
controller, these hypercalls will fail. Therefore this also adds
checks in the places where the XICS emulation wants to access the
underlying interrupt controller directly, and if that is XIVE, makes
the code use the virtual mode fallback paths, which call generic
kernel infrastructure rather than doing direct XICS access.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-04 14:07:20 +03:00
# include <asm/xive.h>
# include <asm/cpu_has_feature.h>
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:23:08 +04:00
# endif
2008-04-17 08:28:09 +04:00
2014-09-09 21:07:35 +04:00
/*
* KVMPPC_INST_SW_BREAKPOINT is debug Instruction
* for supporting software breakpoint .
*/
# define KVMPPC_INST_SW_BREAKPOINT 0x00dddd00
2008-04-17 08:28:09 +04:00
enum emulation_result {
EMULATE_DONE , /* no further processing */
EMULATE_DO_MMIO , /* kvm_run filled with MMIO request */
EMULATE_FAIL , /* can't emulate this instruction */
2010-02-19 13:00:31 +03:00
EMULATE_AGAIN , /* something went wrong. go again */
2013-04-08 04:32:13 +04:00
EMULATE_EXIT_USER , /* emulation requires exit to user-space */
2008-04-17 08:28:09 +04:00
} ;
2018-05-21 08:24:21 +03:00
enum instruction_fetch_type {
2014-07-23 20:06:21 +04:00
INST_GENERIC ,
INST_SC , /* system call */
} ;
2014-06-20 15:52:36 +04:00
enum xlate_instdata {
XLATE_INST , /* translate instruction address */
XLATE_DATA /* translate data address */
} ;
enum xlate_readwrite {
XLATE_READ , /* check for read permissions */
XLATE_WRITE /* check for write permissions */
} ;
2020-04-27 07:35:11 +03:00
extern int kvmppc_vcpu_run ( struct kvm_vcpu * vcpu ) ;
2020-06-23 16:14:16 +03:00
extern int __kvmppc_vcpu_run ( struct kvm_vcpu * vcpu ) ;
2009-10-30 08:47:07 +03:00
extern void kvmppc_handler_highmem ( void ) ;
2008-04-17 08:28:09 +04:00
extern void kvmppc_dump_vcpu ( struct kvm_vcpu * vcpu ) ;
2020-04-27 07:35:11 +03:00
extern int kvmppc_handle_load ( struct kvm_vcpu * vcpu ,
2008-04-17 08:28:09 +04:00
unsigned int rt , unsigned int bytes ,
2014-01-09 14:51:16 +04:00
int is_default_endian ) ;
2020-04-27 07:35:11 +03:00
extern int kvmppc_handle_loads ( struct kvm_vcpu * vcpu ,
2010-02-19 13:00:30 +03:00
unsigned int rt , unsigned int bytes ,
2014-01-09 14:51:16 +04:00
int is_default_endian ) ;
2020-04-27 07:35:11 +03:00
extern int kvmppc_handle_vsx_load ( struct kvm_vcpu * vcpu ,
KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-21 16:12:36 +03:00
unsigned int rt , unsigned int bytes ,
int is_default_endian , int mmio_sign_extend ) ;
2020-04-27 07:35:11 +03:00
extern int kvmppc_handle_vmx_load ( struct kvm_vcpu * vcpu ,
2018-05-21 08:24:26 +03:00
unsigned int rt , unsigned int bytes , int is_default_endian ) ;
2020-04-27 07:35:11 +03:00
extern int kvmppc_handle_vmx_store ( struct kvm_vcpu * vcpu ,
2018-05-21 08:24:26 +03:00
unsigned int rs , unsigned int bytes , int is_default_endian ) ;
2020-04-27 07:35:11 +03:00
extern int kvmppc_handle_store ( struct kvm_vcpu * vcpu ,
2014-01-09 14:51:16 +04:00
u64 val , unsigned int bytes ,
int is_default_endian ) ;
2020-04-27 07:35:11 +03:00
extern int kvmppc_handle_vsx_store ( struct kvm_vcpu * vcpu ,
KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-21 16:12:36 +03:00
int rs , unsigned int bytes ,
int is_default_endian ) ;
2008-04-17 08:28:09 +04:00
2014-07-23 20:06:21 +04:00
extern int kvmppc_load_last_inst ( struct kvm_vcpu * vcpu ,
2018-05-21 08:24:21 +03:00
enum instruction_fetch_type type , u32 * inst ) ;
2014-07-23 20:06:21 +04:00
2014-06-20 15:58:16 +04:00
extern int kvmppc_ld ( struct kvm_vcpu * vcpu , ulong * eaddr , int size , void * ptr ,
bool data ) ;
extern int kvmppc_st ( struct kvm_vcpu * vcpu , ulong * eaddr , int size , void * ptr ,
bool data ) ;
2020-04-27 07:35:11 +03:00
extern int kvmppc_emulate_instruction ( struct kvm_vcpu * vcpu ) ;
2014-06-18 16:53:49 +04:00
extern int kvmppc_emulate_loadstore ( struct kvm_vcpu * vcpu ) ;
2020-04-27 07:35:11 +03:00
extern int kvmppc_emulate_mmio ( struct kvm_vcpu * vcpu ) ;
2008-11-05 18:36:16 +03:00
extern void kvmppc_emulate_dec ( struct kvm_vcpu * vcpu ) ;
2011-04-28 02:24:21 +04:00
extern u32 kvmppc_get_dec ( struct kvm_vcpu * vcpu , u64 tb ) ;
2014-09-01 18:19:56 +04:00
extern void kvmppc_decrementer_func ( struct kvm_vcpu * vcpu ) ;
2011-08-10 15:57:08 +04:00
extern int kvmppc_sanity_check ( struct kvm_vcpu * vcpu ) ;
2012-08-09 00:38:19 +04:00
extern int kvmppc_subarch_vcpu_init ( struct kvm_vcpu * vcpu ) ;
extern void kvmppc_subarch_vcpu_uninit ( struct kvm_vcpu * vcpu ) ;
2008-04-17 08:28:09 +04:00
2009-01-04 01:22:59 +03:00
/* Core-specific hooks */
2008-12-03 00:51:53 +03:00
extern void kvmppc_mmu_map ( struct kvm_vcpu * vcpu , u64 gvaddr , gpa_t gpaddr ,
2008-12-03 00:51:55 +03:00
unsigned int gtlb_idx ) ;
2008-04-17 08:28:09 +04:00
extern void kvmppc_mmu_priv_switch ( struct kvm_vcpu * vcpu , int usermode ) ;
2008-07-25 22:54:53 +04:00
extern void kvmppc_mmu_switch_pid ( struct kvm_vcpu * vcpu , u32 pid ) ;
2009-01-04 01:23:03 +03:00
extern int kvmppc_mmu_dtlb_index ( struct kvm_vcpu * vcpu , gva_t eaddr ) ;
extern int kvmppc_mmu_itlb_index ( struct kvm_vcpu * vcpu , gva_t eaddr ) ;
2009-01-04 01:23:02 +03:00
extern gpa_t kvmppc_mmu_xlate ( struct kvm_vcpu * vcpu , unsigned int gtlb_index ,
gva_t eaddr ) ;
2009-01-04 01:23:11 +03:00
extern void kvmppc_mmu_dtlb_miss ( struct kvm_vcpu * vcpu ) ;
extern void kvmppc_mmu_itlb_miss ( struct kvm_vcpu * vcpu ) ;
2014-06-20 15:52:36 +04:00
extern int kvmppc_xlate ( struct kvm_vcpu * vcpu , ulong eaddr ,
enum xlate_instdata xlid , enum xlate_readwrite xlrw ,
struct kvmppc_pte * pte ) ;
2008-11-05 18:36:14 +03:00
2019-12-19 00:55:00 +03:00
extern int kvmppc_core_vcpu_create ( struct kvm_vcpu * vcpu ) ;
2008-11-05 18:36:18 +03:00
extern void kvmppc_core_vcpu_free ( struct kvm_vcpu * vcpu ) ;
2008-11-05 18:36:17 +03:00
extern int kvmppc_core_vcpu_setup ( struct kvm_vcpu * vcpu ) ;
2008-11-05 18:36:14 +03:00
extern int kvmppc_core_check_processor_compat ( void ) ;
2008-11-05 18:36:17 +03:00
extern int kvmppc_core_vcpu_translate ( struct kvm_vcpu * vcpu ,
struct kvm_translation * tr ) ;
2008-11-05 18:36:14 +03:00
extern void kvmppc_core_vcpu_load ( struct kvm_vcpu * vcpu , int cpu ) ;
extern void kvmppc_core_vcpu_put ( struct kvm_vcpu * vcpu ) ;
2012-02-16 18:07:37 +04:00
extern int kvmppc_core_prepare_to_enter ( struct kvm_vcpu * vcpu ) ;
2008-11-05 18:36:14 +03:00
extern int kvmppc_core_pending_dec ( struct kvm_vcpu * vcpu ) ;
KVM: PPC: Book3S HV: Simplify machine check handling
This makes the handling of machine check interrupts that occur inside
a guest simpler and more robust, with less done in assembler code and
in real mode.
Now, when a machine check occurs inside a guest, we always get the
machine check event struct and put a copy in the vcpu struct for the
vcpu where the machine check occurred. We no longer call
machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
on POWER8, when a vcpu is running on an offline secondary thread and
we call machine_check_queue_event(), that calls irq_work_queue(),
which doesn't work because the CPU is offline, but instead triggers
the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
fires again and again because nothing clears the condition).
All that machine_check_queue_event() actually does is to cause the
event to be printed to the console. For a machine check occurring in
the guest, we now print the event in kvmppc_handle_exit_hv()
instead.
The assembly code at label machine_check_realmode now just calls C
code and then continues exiting the guest. We no longer either
synthesize a machine check for the guest in assembly code or return
to the guest without a machine check.
The code in kvmppc_handle_exit_hv() is extended to handle the case
where the guest is not FWNMI-capable. In that case we now always
synthesize a machine check interrupt for the guest. Previously, if
the host thinks it has recovered the machine check fully, it would
return to the guest without any notification that the machine check
had occurred. If the machine check was caused by some action of the
guest (such as creating duplicate SLB entries), it is much better to
tell the guest that it has caused a problem. Therefore we now always
generate a machine check interrupt for guests that are not
FWNMI-capable.
Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21 05:38:49 +03:00
extern void kvmppc_core_queue_machine_check ( struct kvm_vcpu * vcpu , ulong flags ) ;
KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path
In the interest of minimising the amount of code that is run in
"real-mode", don't handle hcalls in real mode in the P9 path. This
requires some new handlers for H_CEDE and xics-on-xive to be added
before xive is pulled or cede logic is checked.
This introduces a change in radix guest behaviour where radix guests
that execute 'sc 1' in userspace now get a privilege fault whereas
previously the 'sc 1' would be reflected as a syscall interrupt to the
guest kernel. That reflection is only required for hash guests that run
PR KVM.
Background:
In POWER8 and earlier processors, it is very expensive to exit from the
HV real mode context of a guest hypervisor interrupt, and switch to host
virtual mode. On those processors, guest->HV interrupts reach the
hypervisor with the MMU off because the MMU is loaded with guest context
(LPCR, SDR1, SLB), and the other threads in the sub-core need to be
pulled out of the guest too. Then the primary must save off guest state,
invalidate SLB and ERAT, and load up host state before the MMU can be
enabled to run in host virtual mode (~= regular Linux mode).
Hash guests also require a lot of hcalls to run due to the nature of the
MMU architecture and paravirtualisation design. The XICS interrupt
controller requires hcalls to run.
So KVM traditionally tries hard to avoid the full exit, by handling
hcalls and other interrupts in real mode as much as possible.
By contrast, POWER9 has independent MMU context per-thread, and in radix
mode the hypervisor is in host virtual memory mode when the HV interrupt
is taken. Radix guests do not require significant hcalls to manage their
translations, and xive guests don't need hcalls to handle interrupts. So
it's much less important for performance to handle hcalls in real mode on
POWER9.
One caveat is that the TCE hcalls are performance critical, real-mode
variants introduced for POWER8 in order to achieve 10GbE performance.
Real mode TCE hcalls were found to be less important on POWER9, which
was able to drive 40GBe networking without them (using the virt mode
hcalls) but performance is still important. These hcalls will benefit
from subsequent guest entry/exit optimisation including possibly a
faster "partial exit" that does not entirely switch to host context to
handle the hcall.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210528090752.3542186-14-npiggin@gmail.com
2021-05-28 12:07:33 +03:00
extern void kvmppc_core_queue_syscall ( struct kvm_vcpu * vcpu ) ;
2010-01-08 04:58:07 +03:00
extern void kvmppc_core_queue_program ( struct kvm_vcpu * vcpu , ulong flags ) ;
2017-03-22 13:02:08 +03:00
extern void kvmppc_core_queue_fpunavail ( struct kvm_vcpu * vcpu ) ;
extern void kvmppc_core_queue_vec_unavail ( struct kvm_vcpu * vcpu ) ;
extern void kvmppc_core_queue_vsx_unavail ( struct kvm_vcpu * vcpu ) ;
2008-11-05 18:36:14 +03:00
extern void kvmppc_core_queue_dec ( struct kvm_vcpu * vcpu ) ;
2009-12-21 22:21:24 +03:00
extern void kvmppc_core_dequeue_dec ( struct kvm_vcpu * vcpu ) ;
2008-11-05 18:36:14 +03:00
extern void kvmppc_core_queue_external ( struct kvm_vcpu * vcpu ,
struct kvm_interrupt * irq ) ;
2013-02-14 18:00:25 +04:00
extern void kvmppc_core_dequeue_external ( struct kvm_vcpu * vcpu ) ;
2014-06-18 23:56:55 +04:00
extern void kvmppc_core_queue_dtlb_miss ( struct kvm_vcpu * vcpu , ulong dear_flags ,
ulong esr_flags ) ;
extern void kvmppc_core_queue_data_storage ( struct kvm_vcpu * vcpu ,
ulong dear_flags ,
ulong esr_flags ) ;
extern void kvmppc_core_queue_itlb_miss ( struct kvm_vcpu * vcpu ) ;
extern void kvmppc_core_queue_inst_storage ( struct kvm_vcpu * vcpu ,
ulong esr_flags ) ;
2012-07-31 02:19:50 +04:00
extern void kvmppc_core_flush_tlb ( struct kvm_vcpu * vcpu ) ;
2012-08-13 14:50:35 +04:00
extern int kvmppc_core_check_requests ( struct kvm_vcpu * vcpu ) ;
2008-11-05 18:36:16 +03:00
2008-11-05 18:36:18 +03:00
extern int kvmppc_booke_init ( void ) ;
extern void kvmppc_booke_exit ( void ) ;
2008-11-24 20:37:38 +03:00
extern void kvmppc_core_destroy_mmu ( struct kvm_vcpu * vcpu ) ;
2010-07-29 16:47:48 +04:00
extern int kvmppc_kvm_pv ( struct kvm_vcpu * vcpu ) ;
2011-06-15 03:34:41 +04:00
extern void kvmppc_map_magic ( struct kvm_vcpu * vcpu ) ;
2008-11-24 20:37:38 +03:00
KVM: PPC: Book3S HV: Split HPT allocation from activation
Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
and sets it up as the active page table for a VM. For the upcoming HPT
resize implementation we're going to want to allocate HPTs separately from
activating them.
So, split the allocation itself out into kvmppc_allocate_hpt() and perform
the activation with a new kvmppc_set_hpt() function. Likewise we split
kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
which unsets it as an active HPT, then frees it.
We also move the logic to fall back to smaller HPT sizes if the first try
fails into the single caller which used that behaviour,
kvmppc_hv_setup_htab_rma(). This introduces a slight semantic change, in
that previously if the initial attempt at CMA allocation failed, we would
fall back to attempting smaller sizes with the page allocator. Now, we
try first CMA, then the page allocator at each size. As far as I can tell
this change should be harmless.
To match, we make kvmppc_free_hpt() just free the actual HPT itself. The
call to kvmppc_free_lpid() that was there, we move to the single caller.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-12-20 08:49:02 +03:00
extern int kvmppc_allocate_hpt ( struct kvm_hpt_info * info , u32 order ) ;
extern void kvmppc_set_hpt ( struct kvm * kvm , struct kvm_hpt_info * info ) ;
2016-12-20 08:49:03 +03:00
extern long kvmppc_alloc_reset_hpt ( struct kvm * kvm , int order ) ;
KVM: PPC: Book3S HV: Split HPT allocation from activation
Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
and sets it up as the active page table for a VM. For the upcoming HPT
resize implementation we're going to want to allocate HPTs separately from
activating them.
So, split the allocation itself out into kvmppc_allocate_hpt() and perform
the activation with a new kvmppc_set_hpt() function. Likewise we split
kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
which unsets it as an active HPT, then frees it.
We also move the logic to fall back to smaller HPT sizes if the first try
fails into the single caller which used that behaviour,
kvmppc_hv_setup_htab_rma(). This introduces a slight semantic change, in
that previously if the initial attempt at CMA allocation failed, we would
fall back to attempting smaller sizes with the page allocator. Now, we
try first CMA, then the page allocator at each size. As far as I can tell
this change should be harmless.
To match, we make kvmppc_free_hpt() just free the actual HPT itself. The
call to kvmppc_free_lpid() that was there, we move to the single caller.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-12-20 08:49:02 +03:00
extern void kvmppc_free_hpt ( struct kvm_hpt_info * info ) ;
2017-09-13 09:00:10 +03:00
extern void kvmppc_rmap_reset ( struct kvm * kvm ) ;
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:21:34 +04:00
extern long kvmppc_prepare_vrma ( struct kvm * kvm ,
struct kvm_userspace_memory_region * mem ) ;
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()
This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.
Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.
When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 16:31:00 +04:00
extern void kvmppc_map_vrma ( struct kvm_vcpu * vcpu ,
2011-12-12 16:31:41 +04:00
struct kvm_memory_slot * memslot , unsigned long porder ) ;
2011-06-29 04:22:05 +04:00
extern int kvmppc_pseries_do_hcall ( struct kvm_vcpu * vcpu ) ;
2017-03-22 07:21:56 +03:00
extern long kvm_spapr_tce_attach_iommu_group ( struct kvm * kvm , int tablefd ,
struct iommu_group * grp ) ;
extern void kvm_spapr_tce_release_iommu_group ( struct kvm * kvm ,
struct iommu_group * grp ) ;
2017-09-13 09:00:10 +03:00
extern int kvmppc_switch_mmu_to_hpt ( struct kvm * kvm ) ;
extern int kvmppc_switch_mmu_to_radix ( struct kvm * kvm ) ;
2017-11-22 06:38:53 +03:00
extern void kvmppc_setup_partition_table ( struct kvm * kvm ) ;
2013-04-18 00:30:26 +04:00
2011-06-29 04:22:41 +04:00
extern long kvm_vm_ioctl_create_spapr_tce ( struct kvm * kvm ,
2016-03-01 09:54:40 +03:00
struct kvm_create_spapr_tce_64 * args ) ;
2016-02-15 04:55:09 +03:00
extern struct kvmppc_spapr_tce_table * kvmppc_find_table (
2017-03-22 07:21:53 +03:00
struct kvm * kvm , unsigned long liobn ) ;
2017-03-22 07:21:55 +03:00
# define kvmppc_ioba_validate(stt, ioba, npages) \
( iommu_tce_check_ioba ( ( stt ) - > page_shift , ( stt ) - > offset , \
( stt ) - > size , ( ioba ) , ( npages ) ) ? \
H_PARAMETER : H_SUCCESS )
2012-03-16 01:58:34 +04:00
extern long kvmppc_h_put_tce ( struct kvm_vcpu * vcpu , unsigned long liobn ,
unsigned long ioba , unsigned long tce ) ;
2016-02-15 04:55:09 +03:00
extern long kvmppc_h_put_tce_indirect ( struct kvm_vcpu * vcpu ,
unsigned long liobn , unsigned long ioba ,
unsigned long tce_list , unsigned long npages ) ;
extern long kvmppc_h_stuff_tce ( struct kvm_vcpu * vcpu ,
unsigned long liobn , unsigned long ioba ,
unsigned long tce_value , unsigned long npages ) ;
2014-02-21 19:31:10 +04:00
extern long kvmppc_h_get_tce ( struct kvm_vcpu * vcpu , unsigned long liobn ,
unsigned long ioba ) ;
2016-12-20 08:48:59 +03:00
extern struct page * kvm_alloc_hpt_cma ( unsigned long nr_pages ) ;
extern void kvm_free_hpt_cma ( struct page * page , unsigned long nr_pages ) ;
2011-06-29 04:19:22 +04:00
extern int kvmppc_core_init_vm ( struct kvm * kvm ) ;
extern void kvmppc_core_destroy_vm ( struct kvm * kvm ) ;
2013-10-07 20:48:00 +04:00
extern void kvmppc_core_free_memslot ( struct kvm * kvm ,
2020-02-19 00:07:27 +03:00
struct kvm_memory_slot * slot ) ;
2011-06-29 04:19:22 +04:00
extern int kvmppc_core_prepare_memory_region ( struct kvm * kvm ,
2021-12-06 22:54:11 +03:00
const struct kvm_memory_slot * old ,
struct kvm_memory_slot * new ,
2020-02-19 00:07:18 +03:00
enum kvm_mr_change change ) ;
2011-06-29 04:19:22 +04:00
extern void kvmppc_core_commit_memory_region ( struct kvm * kvm ,
2021-12-06 22:54:11 +03:00
struct kvm_memory_slot * old ,
2018-12-12 07:15:30 +03:00
const struct kvm_memory_slot * new ,
enum kvm_mr_change change ) ;
2012-04-26 23:43:42 +04:00
extern int kvm_vm_ioctl_get_smmu_info ( struct kvm * kvm ,
struct kvm_ppc_smmu_info * info ) ;
2012-09-11 17:28:18 +04:00
extern void kvmppc_core_flush_memslot ( struct kvm * kvm ,
struct kvm_memory_slot * memslot ) ;
2011-06-29 04:19:22 +04:00
2011-12-20 19:34:43 +04:00
extern int kvmppc_bookehv_init ( void ) ;
extern void kvmppc_bookehv_exit ( void ) ;
2012-08-10 14:28:50 +04:00
extern int kvmppc_prepare_to_enter ( struct kvm_vcpu * vcpu ) ;
KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on
this fd return the contents of the HPT (hashed page table), writes
create and/or remove entries in the HPT. There is a new capability,
KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl
takes an argument structure with the index of the first HPT entry to
read out and a set of flags. The flags indicate whether the user is
intending to read or write the HPT, and whether to return all entries
or only the "bolted" entries (those with the bolted bit, 0x10, set in
the first doubleword).
This is intended for use in implementing qemu's savevm/loadvm and for
live migration. Therefore, on reads, the first pass returns information
about all HPTEs (or all bolted HPTEs). When the first pass reaches the
end of the HPT, it returns from the read. Subsequent reads only return
information about HPTEs that have changed since they were last read.
A read that finds no changed HPTEs in the HPT following where the last
read finished will return 0 bytes.
The format of the data provides a simple run-length compression of the
invalid entries. Each block of data starts with a header that indicates
the index (position in the HPT, which is just an array), the number of
valid entries starting at that index (may be zero), and the number of
invalid entries following those valid entries. The valid entries, 16
bytes each, follow the header. The invalid entries are not explicitly
represented.
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix documentation]
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-20 02:57:20 +04:00
extern int kvm_vm_ioctl_get_htab_fd ( struct kvm * kvm , struct kvm_get_htab_fd * ) ;
2016-12-20 08:49:05 +03:00
extern long kvm_vm_ioctl_resize_hpt_prepare ( struct kvm * kvm ,
struct kvm_ppc_resize_hpt * rhpt ) ;
extern long kvm_vm_ioctl_resize_hpt_commit ( struct kvm * kvm ,
struct kvm_ppc_resize_hpt * rhpt ) ;
KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on
this fd return the contents of the HPT (hashed page table), writes
create and/or remove entries in the HPT. There is a new capability,
KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl
takes an argument structure with the index of the first HPT entry to
read out and a set of flags. The flags indicate whether the user is
intending to read or write the HPT, and whether to return all entries
or only the "bolted" entries (those with the bolted bit, 0x10, set in
the first doubleword).
This is intended for use in implementing qemu's savevm/loadvm and for
live migration. Therefore, on reads, the first pass returns information
about all HPTEs (or all bolted HPTEs). When the first pass reaches the
end of the HPT, it returns from the read. Subsequent reads only return
information about HPTEs that have changed since they were last read.
A read that finds no changed HPTEs in the HPT following where the last
read finished will return 0 bytes.
The format of the data provides a simple run-length compression of the
invalid entries. Each block of data starts with a header that indicates
the index (position in the HPT, which is just an array), the number of
valid entries starting at that index (may be zero), and the number of
invalid entries following those valid entries. The valid entries, 16
bytes each, follow the header. The invalid entries are not explicitly
represented.
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix documentation]
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-20 02:57:20 +04:00
2013-04-12 18:08:46 +04:00
int kvm_vcpu_ioctl_interrupt ( struct kvm_vcpu * vcpu , struct kvm_interrupt * irq ) ;
2013-04-18 00:30:00 +04:00
extern int kvm_vm_ioctl_rtas_define_token ( struct kvm * kvm , void __user * argp ) ;
extern int kvmppc_rtas_hcall ( struct kvm_vcpu * vcpu ) ;
extern void kvmppc_rtas_tokens_free ( struct kvm * kvm ) ;
2017-04-05 10:54:56 +03:00
2013-04-18 00:30:26 +04:00
extern int kvmppc_xics_set_xive ( struct kvm * kvm , u32 irq , u32 server ,
u32 priority ) ;
extern int kvmppc_xics_get_xive ( struct kvm * kvm , u32 irq , u32 * server ,
u32 * priority ) ;
2013-04-18 00:32:04 +04:00
extern int kvmppc_xics_int_on ( struct kvm * kvm , u32 irq ) ;
extern int kvmppc_xics_int_off ( struct kvm * kvm , u32 irq ) ;
2013-04-18 00:30:00 +04:00
2014-08-13 13:09:44 +04:00
void kvmppc_core_dequeue_debug ( struct kvm_vcpu * vcpu ) ;
void kvmppc_core_queue_debug ( struct kvm_vcpu * vcpu ) ;
2013-10-07 20:47:53 +04:00
union kvmppc_one_reg {
u32 wval ;
u64 dval ;
vector128 vval ;
u64 vsxval [ 2 ] ;
KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-21 16:12:36 +03:00
u32 vsx32val [ 4 ] ;
2018-05-21 08:24:26 +03:00
u16 vsx16val [ 8 ] ;
u8 vsx8val [ 16 ] ;
2013-10-07 20:47:53 +04:00
struct {
u64 addr ;
u64 length ;
} vpaval ;
2019-04-18 13:39:35 +03:00
u64 xive_timaval [ 2 ] ;
2013-10-07 20:47:53 +04:00
} ;
struct kvmppc_ops {
2013-10-07 20:48:01 +04:00
struct module * owner ;
2013-10-07 20:47:53 +04:00
int ( * get_sregs ) ( struct kvm_vcpu * vcpu , struct kvm_sregs * sregs ) ;
int ( * set_sregs ) ( struct kvm_vcpu * vcpu , struct kvm_sregs * sregs ) ;
int ( * get_one_reg ) ( struct kvm_vcpu * vcpu , u64 id ,
union kvmppc_one_reg * val ) ;
int ( * set_one_reg ) ( struct kvm_vcpu * vcpu , u64 id ,
union kvmppc_one_reg * val ) ;
void ( * vcpu_load ) ( struct kvm_vcpu * vcpu , int cpu ) ;
void ( * vcpu_put ) ( struct kvm_vcpu * vcpu ) ;
2019-10-02 09:00:22 +03:00
void ( * inject_interrupt ) ( struct kvm_vcpu * vcpu , int vec , u64 srr1_flags ) ;
2013-10-07 20:47:53 +04:00
void ( * set_msr ) ( struct kvm_vcpu * vcpu , u64 msr ) ;
2020-04-27 07:35:11 +03:00
int ( * vcpu_run ) ( struct kvm_vcpu * vcpu ) ;
2019-12-19 00:55:00 +03:00
int ( * vcpu_create ) ( struct kvm_vcpu * vcpu ) ;
2013-10-07 20:47:53 +04:00
void ( * vcpu_free ) ( struct kvm_vcpu * vcpu ) ;
int ( * check_requests ) ( struct kvm_vcpu * vcpu ) ;
int ( * get_dirty_log ) ( struct kvm * kvm , struct kvm_dirty_log * log ) ;
void ( * flush_memslot ) ( struct kvm * kvm , struct kvm_memory_slot * memslot ) ;
int ( * prepare_memory_region ) ( struct kvm * kvm ,
2021-12-06 22:54:11 +03:00
const struct kvm_memory_slot * old ,
struct kvm_memory_slot * new ,
2020-02-19 00:07:18 +03:00
enum kvm_mr_change change ) ;
2013-10-07 20:47:53 +04:00
void ( * commit_memory_region ) ( struct kvm * kvm ,
2021-12-06 22:54:11 +03:00
struct kvm_memory_slot * old ,
2018-12-12 07:15:30 +03:00
const struct kvm_memory_slot * new ,
enum kvm_mr_change change ) ;
2021-04-02 03:56:53 +03:00
bool ( * unmap_gfn_range ) ( struct kvm * kvm , struct kvm_gfn_range * range ) ;
bool ( * age_gfn ) ( struct kvm * kvm , struct kvm_gfn_range * range ) ;
bool ( * test_age_gfn ) ( struct kvm * kvm , struct kvm_gfn_range * range ) ;
bool ( * set_spte_gfn ) ( struct kvm * kvm , struct kvm_gfn_range * range ) ;
2020-02-19 00:07:27 +03:00
void ( * free_memslot ) ( struct kvm_memory_slot * slot ) ;
2013-10-07 20:47:53 +04:00
int ( * init_vm ) ( struct kvm * kvm ) ;
void ( * destroy_vm ) ( struct kvm * kvm ) ;
int ( * get_smmu_info ) ( struct kvm * kvm , struct kvm_ppc_smmu_info * info ) ;
2020-04-27 07:35:11 +03:00
int ( * emulate_op ) ( struct kvm_vcpu * vcpu ,
2013-10-07 20:47:53 +04:00
unsigned int inst , int * advance ) ;
int ( * emulate_mtspr ) ( struct kvm_vcpu * vcpu , int sprn , ulong spr_val ) ;
int ( * emulate_mfspr ) ( struct kvm_vcpu * vcpu , int sprn , ulong * spr_val ) ;
void ( * fast_vcpu_kick ) ( struct kvm_vcpu * vcpu ) ;
long ( * arch_vm_ioctl ) ( struct file * filp , unsigned int ioctl ,
unsigned long arg ) ;
2014-06-02 05:03:00 +04:00
int ( * hcall_implemented ) ( unsigned long hcall ) ;
2016-08-19 08:35:47 +03:00
int ( * irq_bypass_add_producer ) ( struct irq_bypass_consumer * ,
struct irq_bypass_producer * ) ;
void ( * irq_bypass_del_producer ) ( struct irq_bypass_consumer * ,
struct irq_bypass_producer * ) ;
2017-01-30 13:21:41 +03:00
int ( * configure_mmu ) ( struct kvm * kvm , struct kvm_ppc_mmuv3_cfg * cfg ) ;
int ( * get_rmmu_info ) ( struct kvm * kvm , struct kvm_ppc_rmmu_info * info ) ;
KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode
This allows userspace to set the desired virtual SMT (simultaneous
multithreading) mode for a VM, that is, the number of VCPUs that
get assigned to each virtual core. Previously, the virtual SMT mode
was fixed to the number of threads per subcore, and if userspace
wanted to have fewer vcpus per vcore, then it would achieve that by
using a sparse CPU numbering. This had the disadvantage that the
vcpu numbers can get quite large, particularly for SMT1 guests on
a POWER8 with 8 threads per core. With this patch, userspace can
set its desired virtual SMT mode and then use contiguous vcpu
numbering.
On POWER8, where the threading mode is "strict", the virtual SMT mode
must be less than or equal to the number of threads per subcore. On
POWER9, which implements a "loose" threading mode, the virtual SMT
mode can be any power of 2 between 1 and 8, even though there is
effectively one thread per subcore, since the threads are independent
and can all be in different partitions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-06 05:24:41 +03:00
int ( * set_smt_mode ) ( struct kvm * kvm , unsigned long mode ,
unsigned long flags ) ;
2018-05-21 08:24:22 +03:00
void ( * giveup_ext ) ( struct kvm_vcpu * vcpu , ulong msr ) ;
2018-09-21 13:02:01 +03:00
int ( * enable_nested ) ( struct kvm * kvm ) ;
2018-12-14 08:29:06 +03:00
int ( * load_from_eaddr ) ( struct kvm_vcpu * vcpu , ulong * eaddr , void * ptr ,
int size ) ;
int ( * store_to_eaddr ) ( struct kvm_vcpu * vcpu , ulong * eaddr , void * ptr ,
int size ) ;
2020-03-19 07:29:55 +03:00
int ( * enable_svm ) ( struct kvm * kvm ) ;
2019-11-25 06:06:30 +03:00
int ( * svm_off ) ( struct kvm * kvm ) ;
2020-12-16 13:42:19 +03:00
int ( * enable_dawr1 ) ( struct kvm * kvm ) ;
2021-02-05 19:41:54 +03:00
bool ( * hash_v3_possible ) ( void ) ;
2022-01-11 03:54:04 +03:00
int ( * create_vm_debugfs ) ( struct kvm * kvm ) ;
int ( * create_vcpu_debugfs ) ( struct kvm_vcpu * vcpu , struct dentry * debugfs_dentry ) ;
2013-10-07 20:47:53 +04:00
} ;
2013-10-07 20:48:01 +04:00
extern struct kvmppc_ops * kvmppc_hv_ops ;
extern struct kvmppc_ops * kvmppc_pr_ops ;
2013-10-07 20:47:53 +04:00
2014-07-23 20:06:21 +04:00
static inline int kvmppc_get_last_inst ( struct kvm_vcpu * vcpu ,
2018-05-21 08:24:21 +03:00
enum instruction_fetch_type type , u32 * inst )
2014-07-23 20:06:21 +04:00
{
int ret = EMULATE_DONE ;
u32 fetched_inst ;
/* Load the instruction manually if it failed to do so in the
* exit path */
if ( vcpu - > arch . last_inst = = KVM_INST_FETCH_FAILED )
ret = kvmppc_load_last_inst ( vcpu , type , & vcpu - > arch . last_inst ) ;
/* Write fetch_failed unswapped if the fetch failed */
if ( ret = = EMULATE_DONE )
fetched_inst = kvmppc_need_byteswap ( vcpu ) ?
swab32 ( vcpu - > arch . last_inst ) :
vcpu - > arch . last_inst ;
else
fetched_inst = vcpu - > arch . last_inst ;
* inst = fetched_inst ;
return ret ;
}
2013-10-07 20:48:02 +04:00
static inline bool is_kvmppc_hv_enabled ( struct kvm * kvm )
{
return kvm - > arch . kvm_ops = = kvmppc_hv_ops ;
}
2015-03-20 12:39:41 +03:00
extern int kvmppc_hwrng_present ( void ) ;
2010-02-19 13:00:42 +03:00
/*
* Cuts out inst bits with ordering according to spec .
* That means the leftmost bit is zero . All given bits are included .
*/
static inline u32 kvmppc_get_field ( u64 inst , int msb , int lsb )
{
u32 r ;
u32 mask ;
BUG_ON ( msb > lsb ) ;
mask = ( 1 < < ( lsb - msb + 1 ) ) - 1 ;
r = ( inst > > ( 63 - lsb ) ) & mask ;
return r ;
}
/*
* Replaces inst bits with ordering according to spec .
*/
static inline u32 kvmppc_set_field ( u64 inst , int msb , int lsb , int value )
{
u32 r ;
u32 mask ;
BUG_ON ( msb > lsb ) ;
mask = ( ( 1 < < ( lsb - msb + 1 ) ) - 1 ) < < ( 63 - lsb ) ;
r = ( inst & ~ mask ) | ( ( value < < ( 63 - lsb ) ) & mask ) ;
return r ;
}
2012-09-26 00:31:56 +04:00
# define one_reg_size(id) \
( 1ul < < ( ( ( id ) & KVM_REG_SIZE_MASK ) > > KVM_REG_SIZE_SHIFT ) )
# define get_reg_val(id, reg) ({ \
union kvmppc_one_reg __u ; \
switch ( one_reg_size ( id ) ) { \
case 4 : __u . wval = ( reg ) ; break ; \
case 8 : __u . dval = ( reg ) ; break ; \
default : BUG ( ) ; \
} \
__u ; \
} )
# define set_reg_val(id, val) ({ \
u64 __v ; \
switch ( one_reg_size ( id ) ) { \
case 4 : __v = ( val ) . wval ; break ; \
case 8 : __v = ( val ) . dval ; break ; \
default : BUG ( ) ; \
} \
__v ; \
} )
2013-10-07 20:47:53 +04:00
int kvmppc_core_get_sregs ( struct kvm_vcpu * vcpu , struct kvm_sregs * sregs ) ;
2011-04-28 02:24:21 +04:00
int kvmppc_core_set_sregs ( struct kvm_vcpu * vcpu , struct kvm_sregs * sregs ) ;
2013-10-07 20:47:53 +04:00
int kvmppc_get_sregs_ivor ( struct kvm_vcpu * vcpu , struct kvm_sregs * sregs ) ;
2011-04-28 02:24:21 +04:00
int kvmppc_set_sregs_ivor ( struct kvm_vcpu * vcpu , struct kvm_sregs * sregs ) ;
2011-12-12 16:26:50 +04:00
int kvm_vcpu_ioctl_get_one_reg ( struct kvm_vcpu * vcpu , struct kvm_one_reg * reg ) ;
int kvm_vcpu_ioctl_set_one_reg ( struct kvm_vcpu * vcpu , struct kvm_one_reg * reg ) ;
2012-09-26 00:31:56 +04:00
int kvmppc_get_one_reg ( struct kvm_vcpu * vcpu , u64 id , union kvmppc_one_reg * ) ;
int kvmppc_set_one_reg ( struct kvm_vcpu * vcpu , u64 id , union kvmppc_one_reg * ) ;
2011-12-12 16:26:50 +04:00
2011-04-28 02:24:21 +04:00
void kvmppc_set_pid ( struct kvm_vcpu * vcpu , u32 pid ) ;
2013-04-12 18:08:46 +04:00
struct openpic ;
2013-10-07 20:47:52 +04:00
# ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
2013-07-02 09:45:16 +04:00
extern void kvm_cma_reserve ( void ) __init ;
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:23:08 +04:00
static inline void kvmppc_set_xics_phys ( int cpu , unsigned long addr )
{
2018-02-13 18:08:12 +03:00
paca_ptrs [ cpu ] - > kvm_hstate . xics_phys = ( void __iomem * ) addr ;
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:23:08 +04:00
}
KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests
This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility. These processors require a physically
contiguous, aligned area of memory for each guest. When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access. The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.
Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator. The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.
KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs. The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.
This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA. It
also returns the size of the RMA in the argument structure.
Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace. To cope with this, we now preallocate the
kvm->arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory. Subsequently we will get rid of this
array and use memory associated with each memslot instead.
This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region. Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB. However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.
Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu->arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest. This moves the LPCR value into the
kvm->arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:25:44 +04:00
2017-04-05 10:54:56 +03:00
static inline void kvmppc_set_xive_tima ( int cpu ,
unsigned long phys_addr ,
void __iomem * virt_addr )
{
2018-02-13 18:08:12 +03:00
paca_ptrs [ cpu ] - > kvm_hstate . xive_tima_phys = ( void __iomem * ) phys_addr ;
paca_ptrs [ cpu ] - > kvm_hstate . xive_tima_virt = virt_addr ;
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:23:08 +04:00
}
KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests
This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility. These processors require a physically
contiguous, aligned area of memory for each guest. When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access. The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.
Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator. The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.
KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs. The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.
This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA. It
also returns the size of the RMA in the argument structure.
Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace. To cope with this, we now preallocate the
kvm->arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory. Subsequently we will get rid of this
array and use memory associated with each memslot instead.
This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region. Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB. However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.
Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu->arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest. This moves the LPCR value into the
kvm->arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:25:44 +04:00
2013-04-18 00:30:50 +04:00
static inline u32 kvmppc_get_xics_latch ( void )
{
2013-10-07 20:47:56 +04:00
u32 xirr ;
2013-04-18 00:30:50 +04:00
2013-10-07 20:47:56 +04:00
xirr = get_paca ( ) - > kvm_hstate . saved_xirr ;
2013-04-18 00:30:50 +04:00
get_paca ( ) - > kvm_hstate . saved_xirr = 0 ;
return xirr ;
}
KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
of memory running the following guest configs:
guest A:
- 224GB of memory
- 56 VCPUs (sockets=1,cores=28,threads=2), where:
VCPUs 0-1 are pinned to CPUs 0-3,
VCPUs 2-3 are pinned to CPUs 4-7,
...
VCPUs 54-55 are pinned to CPUs 108-111
guest B:
- 4GB of memory
- 4 VCPUs (sockets=1,cores=4,threads=1)
with the following workloads (with KSM and THP enabled in all):
guest A:
stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
guest B:
stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
host:
stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
the below soft-lockup traces were observed after an hour or so and
persisted until the host was reset (this was found to be reliably
reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
and 5.3-rc5):
[ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1253.183319] rcu: 124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941
[ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709]
[ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913]
[ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331]
[ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338]
[ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525]
[ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1316.198032] rcu: 124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243
[ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1379.212629] rcu: 124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714
[ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1442.227115] rcu: 124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403
[ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
[ 1455.111822] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
[ 1455.111905] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
[ 1455.111986] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
[ 1455.112068] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
[ 1455.112159] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
[ 1455.112231] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
[ 1455.112303] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
[ 1455.112392] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
CPUs 45, 24, and 124 are stuck on spin locks, likely held by
CPUs 105 and 31.
CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
target CPU 42. For instance:
# CPU 105 registers (via xmon)
R00 = c00000000020b20c R16 = 00007d1bcd800000
R01 = c00000363eaa7970 R17 = 0000000000000001
R02 = c0000000019b3a00 R18 = 000000000000006b
R03 = 000000000000002a R19 = 00007d537d7aecf0
R04 = 000000000000002a R20 = 60000000000000e0
R05 = 000000000000002a R21 = 0801000000000080
R06 = c0002073fb0caa08 R22 = 0000000000000d60
R07 = c0000000019ddd78 R23 = 0000000000000001
R08 = 000000000000002a R24 = c00000000147a700
R09 = 0000000000000001 R25 = c0002073fb0ca908
R10 = c000008ffeb4e660 R26 = 0000000000000000
R11 = c0002073fb0ca900 R27 = c0000000019e2464
R12 = c000000000050790 R28 = c0000000000812b0
R13 = c000207fff623e00 R29 = c0002073fb0ca808
R14 = 00007d1bbee00000 R30 = c0002073fb0ca800
R15 = 00007d1bcd600000 R31 = 0000000000000800
pc = c00000000020b260 smp_call_function_many+0x3d0/0x460
cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460
lr = c00000000020b20c smp_call_function_many+0x37c/0x460
msr = 900000010288b033 cr = 44024824
ctr = c000000000050790 xer = 0000000000000000 trap = 100
CPU 42 is running normally, doing VCPU work:
# CPU 42 stack trace (via xmon)
[link register ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv]
[c000008ed3343820] c000008ed3343850 (unreliable)
[c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv]
[c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
[c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
[c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
[c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70
[c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120
[c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80
[c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70
--- Exception: c00 (System Call) at 00007d545cfd7694
SP (7d53ff7edf50) is in userspace
It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
was set for CPU 42 by at least 1 of the CPUs waiting in
smp_call_function_many(), but somehow the corresponding
call_single_queue entries were never processed by CPU 42, causing the
callers to spin in csd_lock_wait() indefinitely.
Nick Piggin suggested something similar to the following sequence as
a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
IPI messages seems to be most common, but any mix of CALL_FUNCTION and
!CALL_FUNCTION messages could trigger it):
CPU
X: smp_muxed_ipi_set_message():
X: smp_mb()
X: message[RESCHEDULE] = 1
X: doorbell_global_ipi(42):
X: kvmppc_set_host_ipi(42, 1)
X: ppc_msgsnd_sync()/smp_mb()
X: ppc_msgsnd() -> 42
42: doorbell_exception(): // from CPU X
42: ppc_msgsync()
105: smp_muxed_ipi_set_message():
105: smb_mb()
// STORE DEFERRED DUE TO RE-ORDERING
--105: message[CALL_FUNCTION] = 1
| 105: doorbell_global_ipi(42):
| 105: kvmppc_set_host_ipi(42, 1)
| 42: kvmppc_set_host_ipi(42, 0)
| 42: smp_ipi_demux_relaxed()
| 42: // returns to executing guest
| // RE-ORDERED STORE COMPLETES
->105: message[CALL_FUNCTION] = 1
105: ppc_msgsnd_sync()/smp_mb()
105: ppc_msgsnd() -> 42
42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue
This can be prevented with an smp_mb() at the beginning of
kvmppc_set_host_ipi(), such that stores to message[<type>] (or other
state indicated by the host_ipi flag) are ordered vs. the store to
to host_ipi.
However, doing so might still allow for the following scenario (not
yet observed):
CPU
X: smp_muxed_ipi_set_message():
X: smp_mb()
X: message[RESCHEDULE] = 1
X: doorbell_global_ipi(42):
X: kvmppc_set_host_ipi(42, 1)
X: ppc_msgsnd_sync()/smp_mb()
X: ppc_msgsnd() -> 42
42: doorbell_exception(): // from CPU X
42: ppc_msgsync()
// STORE DEFERRED DUE TO RE-ORDERING
-- 42: kvmppc_set_host_ipi(42, 0)
| 42: smp_ipi_demux_relaxed()
| 105: smp_muxed_ipi_set_message():
| 105: smb_mb()
| 105: message[CALL_FUNCTION] = 1
| 105: doorbell_global_ipi(42):
| 105: kvmppc_set_host_ipi(42, 1)
| // RE-ORDERED STORE COMPLETES
-> 42: kvmppc_set_host_ipi(42, 0)
42: // returns to executing guest
105: ppc_msgsnd_sync()/smp_mb()
105: ppc_msgsnd() -> 42
42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue
Fixing this scenario would require an smp_mb() *after* clearing
host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
subsequent processing of IPI messages.
To handle both cases, this patch splits kvmppc_set_host_ipi() into
separate set/clear functions, where we execute smp_mb() prior to
setting host_ipi flag, and after clearing host_ipi flag. These
functions pair with each other to synchronize the sender and receiver
sides.
With that change in place the above workload ran for 20 hours without
triggering any lock-ups.
Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com
2019-09-12 01:31:55 +03:00
/*
* To avoid the need to unnecessarily exit fully to the host kernel , an IPI to
* a CPU thread that ' s running / napping inside of a guest is by default regarded
* as a request to wake the CPU ( if needed ) and continue execution within the
* guest , potentially to process new state like externally - generated
* interrupts or IPIs sent from within the guest itself ( e . g . H_PROD / H_IPI ) .
*
* To force an exit to the host kernel , kvmppc_set_host_ipi ( ) must be called
* prior to issuing the IPI to set the corresponding ' host_ipi ' flag in the
* target CPU ' s PACA . To avoid unnecessary exits to the host , this flag should
* be immediately cleared via kvmppc_clear_host_ipi ( ) by the IPI handler on
* the receiving side prior to processing the IPI work .
*
* NOTE :
*
* We currently issue an smp_mb ( ) at the beginning of kvmppc_set_host_ipi ( ) .
* This is to guard against sequences such as the following :
*
* CPU
* X : smp_muxed_ipi_set_message ( ) :
* X : smp_mb ( )
* X : message [ RESCHEDULE ] = 1
* X : doorbell_global_ipi ( 42 ) :
* X : kvmppc_set_host_ipi ( 42 )
* X : ppc_msgsnd_sync ( ) / smp_mb ( )
* X : ppc_msgsnd ( ) - > 42
* 42 : doorbell_exception ( ) : // from CPU X
* 42 : ppc_msgsync ( )
* 105 : smp_muxed_ipi_set_message ( ) :
* 105 : smb_mb ( )
* // STORE DEFERRED DUE TO RE-ORDERING
* - - 105 : message [ CALL_FUNCTION ] = 1
* | 105 : doorbell_global_ipi ( 42 ) :
* | 105 : kvmppc_set_host_ipi ( 42 )
* | 42 : kvmppc_clear_host_ipi ( 42 )
* | 42 : smp_ipi_demux_relaxed ( )
* | 42 : // returns to executing guest
* | // RE-ORDERED STORE COMPLETES
* - > 105 : message [ CALL_FUNCTION ] = 1
* 105 : ppc_msgsnd_sync ( ) / smp_mb ( )
* 105 : ppc_msgsnd ( ) - > 42
* 42 : local_paca - > kvm_hstate . host_ipi = = 0 // IPI ignored
* 105 : // hangs waiting on 42 to process messages/call_single_queue
*
* We also issue an smp_mb ( ) at the end of kvmppc_clear_host_ipi ( ) . This is
* to guard against sequences such as the following ( as well as to create
* a read - side pairing with the barrier in kvmppc_set_host_ipi ( ) ) :
*
* CPU
* X : smp_muxed_ipi_set_message ( ) :
* X : smp_mb ( )
* X : message [ RESCHEDULE ] = 1
* X : doorbell_global_ipi ( 42 ) :
* X : kvmppc_set_host_ipi ( 42 )
* X : ppc_msgsnd_sync ( ) / smp_mb ( )
* X : ppc_msgsnd ( ) - > 42
* 42 : doorbell_exception ( ) : // from CPU X
* 42 : ppc_msgsync ( )
* // STORE DEFERRED DUE TO RE-ORDERING
* - - 42 : kvmppc_clear_host_ipi ( 42 )
* | 42 : smp_ipi_demux_relaxed ( )
* | 105 : smp_muxed_ipi_set_message ( ) :
* | 105 : smb_mb ( )
* | 105 : message [ CALL_FUNCTION ] = 1
* | 105 : doorbell_global_ipi ( 42 ) :
* | 105 : kvmppc_set_host_ipi ( 42 )
* | // RE-ORDERED STORE COMPLETES
* - > 42 : kvmppc_clear_host_ipi ( 42 )
* 42 : // returns to executing guest
* 105 : ppc_msgsnd_sync ( ) / smp_mb ( )
* 105 : ppc_msgsnd ( ) - > 42
* 42 : local_paca - > kvm_hstate . host_ipi = = 0 // IPI ignored
* 105 : // hangs waiting on 42 to process messages/call_single_queue
*/
static inline void kvmppc_set_host_ipi ( int cpu )
{
/*
* order stores of IPI messages vs . setting of host_ipi flag
*
* pairs with the barrier in kvmppc_clear_host_ipi ( )
*/
smp_mb ( ) ;
paca_ptrs [ cpu ] - > kvm_hstate . host_ipi = 1 ;
}
static inline void kvmppc_clear_host_ipi ( int cpu )
2013-04-18 00:30:50 +04:00
{
KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
of memory running the following guest configs:
guest A:
- 224GB of memory
- 56 VCPUs (sockets=1,cores=28,threads=2), where:
VCPUs 0-1 are pinned to CPUs 0-3,
VCPUs 2-3 are pinned to CPUs 4-7,
...
VCPUs 54-55 are pinned to CPUs 108-111
guest B:
- 4GB of memory
- 4 VCPUs (sockets=1,cores=4,threads=1)
with the following workloads (with KSM and THP enabled in all):
guest A:
stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
guest B:
stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
host:
stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
the below soft-lockup traces were observed after an hour or so and
persisted until the host was reset (this was found to be reliably
reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
and 5.3-rc5):
[ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1253.183319] rcu: 124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941
[ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709]
[ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913]
[ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331]
[ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338]
[ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525]
[ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1316.198032] rcu: 124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243
[ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1379.212629] rcu: 124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714
[ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1442.227115] rcu: 124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403
[ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
[ 1455.111822] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
[ 1455.111905] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
[ 1455.111986] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
[ 1455.112068] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
[ 1455.112159] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
[ 1455.112231] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
[ 1455.112303] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
[ 1455.112392] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
CPUs 45, 24, and 124 are stuck on spin locks, likely held by
CPUs 105 and 31.
CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
target CPU 42. For instance:
# CPU 105 registers (via xmon)
R00 = c00000000020b20c R16 = 00007d1bcd800000
R01 = c00000363eaa7970 R17 = 0000000000000001
R02 = c0000000019b3a00 R18 = 000000000000006b
R03 = 000000000000002a R19 = 00007d537d7aecf0
R04 = 000000000000002a R20 = 60000000000000e0
R05 = 000000000000002a R21 = 0801000000000080
R06 = c0002073fb0caa08 R22 = 0000000000000d60
R07 = c0000000019ddd78 R23 = 0000000000000001
R08 = 000000000000002a R24 = c00000000147a700
R09 = 0000000000000001 R25 = c0002073fb0ca908
R10 = c000008ffeb4e660 R26 = 0000000000000000
R11 = c0002073fb0ca900 R27 = c0000000019e2464
R12 = c000000000050790 R28 = c0000000000812b0
R13 = c000207fff623e00 R29 = c0002073fb0ca808
R14 = 00007d1bbee00000 R30 = c0002073fb0ca800
R15 = 00007d1bcd600000 R31 = 0000000000000800
pc = c00000000020b260 smp_call_function_many+0x3d0/0x460
cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460
lr = c00000000020b20c smp_call_function_many+0x37c/0x460
msr = 900000010288b033 cr = 44024824
ctr = c000000000050790 xer = 0000000000000000 trap = 100
CPU 42 is running normally, doing VCPU work:
# CPU 42 stack trace (via xmon)
[link register ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv]
[c000008ed3343820] c000008ed3343850 (unreliable)
[c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv]
[c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
[c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
[c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
[c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70
[c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120
[c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80
[c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70
--- Exception: c00 (System Call) at 00007d545cfd7694
SP (7d53ff7edf50) is in userspace
It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
was set for CPU 42 by at least 1 of the CPUs waiting in
smp_call_function_many(), but somehow the corresponding
call_single_queue entries were never processed by CPU 42, causing the
callers to spin in csd_lock_wait() indefinitely.
Nick Piggin suggested something similar to the following sequence as
a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
IPI messages seems to be most common, but any mix of CALL_FUNCTION and
!CALL_FUNCTION messages could trigger it):
CPU
X: smp_muxed_ipi_set_message():
X: smp_mb()
X: message[RESCHEDULE] = 1
X: doorbell_global_ipi(42):
X: kvmppc_set_host_ipi(42, 1)
X: ppc_msgsnd_sync()/smp_mb()
X: ppc_msgsnd() -> 42
42: doorbell_exception(): // from CPU X
42: ppc_msgsync()
105: smp_muxed_ipi_set_message():
105: smb_mb()
// STORE DEFERRED DUE TO RE-ORDERING
--105: message[CALL_FUNCTION] = 1
| 105: doorbell_global_ipi(42):
| 105: kvmppc_set_host_ipi(42, 1)
| 42: kvmppc_set_host_ipi(42, 0)
| 42: smp_ipi_demux_relaxed()
| 42: // returns to executing guest
| // RE-ORDERED STORE COMPLETES
->105: message[CALL_FUNCTION] = 1
105: ppc_msgsnd_sync()/smp_mb()
105: ppc_msgsnd() -> 42
42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue
This can be prevented with an smp_mb() at the beginning of
kvmppc_set_host_ipi(), such that stores to message[<type>] (or other
state indicated by the host_ipi flag) are ordered vs. the store to
to host_ipi.
However, doing so might still allow for the following scenario (not
yet observed):
CPU
X: smp_muxed_ipi_set_message():
X: smp_mb()
X: message[RESCHEDULE] = 1
X: doorbell_global_ipi(42):
X: kvmppc_set_host_ipi(42, 1)
X: ppc_msgsnd_sync()/smp_mb()
X: ppc_msgsnd() -> 42
42: doorbell_exception(): // from CPU X
42: ppc_msgsync()
// STORE DEFERRED DUE TO RE-ORDERING
-- 42: kvmppc_set_host_ipi(42, 0)
| 42: smp_ipi_demux_relaxed()
| 105: smp_muxed_ipi_set_message():
| 105: smb_mb()
| 105: message[CALL_FUNCTION] = 1
| 105: doorbell_global_ipi(42):
| 105: kvmppc_set_host_ipi(42, 1)
| // RE-ORDERED STORE COMPLETES
-> 42: kvmppc_set_host_ipi(42, 0)
42: // returns to executing guest
105: ppc_msgsnd_sync()/smp_mb()
105: ppc_msgsnd() -> 42
42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue
Fixing this scenario would require an smp_mb() *after* clearing
host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
subsequent processing of IPI messages.
To handle both cases, this patch splits kvmppc_set_host_ipi() into
separate set/clear functions, where we execute smp_mb() prior to
setting host_ipi flag, and after clearing host_ipi flag. These
functions pair with each other to synchronize the sender and receiver
sides.
With that change in place the above workload ran for 20 hours without
triggering any lock-ups.
Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com
2019-09-12 01:31:55 +03:00
paca_ptrs [ cpu ] - > kvm_hstate . host_ipi = 0 ;
/*
* order clearing of host_ipi flag vs . processing of IPI messages
*
* pairs with the barrier in kvmppc_set_host_ipi ( )
*/
smp_mb ( ) ;
2013-04-18 00:30:50 +04:00
}
2013-10-07 20:47:53 +04:00
static inline void kvmppc_fast_vcpu_kick ( struct kvm_vcpu * vcpu )
{
2013-10-07 20:48:01 +04:00
vcpu - > kvm - > arch . kvm_ops - > fast_vcpu_kick ( vcpu ) ;
2013-10-07 20:47:53 +04:00
}
KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests
This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility. These processors require a physically
contiguous, aligned area of memory for each guest. When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access. The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.
Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator. The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.
KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs. The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.
This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA. It
also returns the size of the RMA in the argument structure.
Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace. To cope with this, we now preallocate the
kvm->arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory. Subsequently we will get rid of this
array and use memory associated with each memslot instead.
This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region. Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB. However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.
Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu->arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest. This moves the LPCR value into the
kvm->arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:25:44 +04:00
2014-05-23 12:15:25 +04:00
extern void kvm_hv_vm_activated ( void ) ;
extern void kvm_hv_vm_deactivated ( void ) ;
extern bool kvm_hv_mode_active ( void ) ;
2021-11-23 12:52:20 +03:00
extern void kvmppc_check_need_tlb_flush ( struct kvm * kvm , int pcpu ) ;
2019-04-29 12:00:40 +03:00
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:23:08 +04:00
# else
2013-07-02 09:45:16 +04:00
static inline void __init kvm_cma_reserve ( void )
{ }
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:23:08 +04:00
static inline void kvmppc_set_xics_phys ( int cpu , unsigned long addr )
{ }
KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests
This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility. These processors require a physically
contiguous, aligned area of memory for each guest. When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access. The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.
Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator. The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.
KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs. The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.
This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA. It
also returns the size of the RMA in the argument structure.
Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace. To cope with this, we now preallocate the
kvm->arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory. Subsequently we will get rid of this
array and use memory associated with each memslot instead.
This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region. Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB. However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.
Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu->arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest. This moves the LPCR value into the
kvm->arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:25:44 +04:00
2017-04-05 10:54:56 +03:00
static inline void kvmppc_set_xive_tima ( int cpu ,
unsigned long phys_addr ,
void __iomem * virt_addr )
{ }
2013-04-18 00:30:50 +04:00
static inline u32 kvmppc_get_xics_latch ( void )
{
return 0 ;
}
KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
of memory running the following guest configs:
guest A:
- 224GB of memory
- 56 VCPUs (sockets=1,cores=28,threads=2), where:
VCPUs 0-1 are pinned to CPUs 0-3,
VCPUs 2-3 are pinned to CPUs 4-7,
...
VCPUs 54-55 are pinned to CPUs 108-111
guest B:
- 4GB of memory
- 4 VCPUs (sockets=1,cores=4,threads=1)
with the following workloads (with KSM and THP enabled in all):
guest A:
stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
guest B:
stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
host:
stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
the below soft-lockup traces were observed after an hour or so and
persisted until the host was reset (this was found to be reliably
reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
and 5.3-rc5):
[ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1253.183319] rcu: 124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941
[ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709]
[ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913]
[ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331]
[ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338]
[ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525]
[ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1316.198032] rcu: 124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243
[ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1379.212629] rcu: 124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714
[ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1442.227115] rcu: 124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403
[ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
[ 1455.111822] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
[ 1455.111905] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
[ 1455.111986] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
[ 1455.112068] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
[ 1455.112159] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
[ 1455.112231] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
[ 1455.112303] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
[ 1455.112392] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
CPUs 45, 24, and 124 are stuck on spin locks, likely held by
CPUs 105 and 31.
CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
target CPU 42. For instance:
# CPU 105 registers (via xmon)
R00 = c00000000020b20c R16 = 00007d1bcd800000
R01 = c00000363eaa7970 R17 = 0000000000000001
R02 = c0000000019b3a00 R18 = 000000000000006b
R03 = 000000000000002a R19 = 00007d537d7aecf0
R04 = 000000000000002a R20 = 60000000000000e0
R05 = 000000000000002a R21 = 0801000000000080
R06 = c0002073fb0caa08 R22 = 0000000000000d60
R07 = c0000000019ddd78 R23 = 0000000000000001
R08 = 000000000000002a R24 = c00000000147a700
R09 = 0000000000000001 R25 = c0002073fb0ca908
R10 = c000008ffeb4e660 R26 = 0000000000000000
R11 = c0002073fb0ca900 R27 = c0000000019e2464
R12 = c000000000050790 R28 = c0000000000812b0
R13 = c000207fff623e00 R29 = c0002073fb0ca808
R14 = 00007d1bbee00000 R30 = c0002073fb0ca800
R15 = 00007d1bcd600000 R31 = 0000000000000800
pc = c00000000020b260 smp_call_function_many+0x3d0/0x460
cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460
lr = c00000000020b20c smp_call_function_many+0x37c/0x460
msr = 900000010288b033 cr = 44024824
ctr = c000000000050790 xer = 0000000000000000 trap = 100
CPU 42 is running normally, doing VCPU work:
# CPU 42 stack trace (via xmon)
[link register ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv]
[c000008ed3343820] c000008ed3343850 (unreliable)
[c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv]
[c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
[c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
[c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
[c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70
[c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120
[c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80
[c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70
--- Exception: c00 (System Call) at 00007d545cfd7694
SP (7d53ff7edf50) is in userspace
It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
was set for CPU 42 by at least 1 of the CPUs waiting in
smp_call_function_many(), but somehow the corresponding
call_single_queue entries were never processed by CPU 42, causing the
callers to spin in csd_lock_wait() indefinitely.
Nick Piggin suggested something similar to the following sequence as
a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
IPI messages seems to be most common, but any mix of CALL_FUNCTION and
!CALL_FUNCTION messages could trigger it):
CPU
X: smp_muxed_ipi_set_message():
X: smp_mb()
X: message[RESCHEDULE] = 1
X: doorbell_global_ipi(42):
X: kvmppc_set_host_ipi(42, 1)
X: ppc_msgsnd_sync()/smp_mb()
X: ppc_msgsnd() -> 42
42: doorbell_exception(): // from CPU X
42: ppc_msgsync()
105: smp_muxed_ipi_set_message():
105: smb_mb()
// STORE DEFERRED DUE TO RE-ORDERING
--105: message[CALL_FUNCTION] = 1
| 105: doorbell_global_ipi(42):
| 105: kvmppc_set_host_ipi(42, 1)
| 42: kvmppc_set_host_ipi(42, 0)
| 42: smp_ipi_demux_relaxed()
| 42: // returns to executing guest
| // RE-ORDERED STORE COMPLETES
->105: message[CALL_FUNCTION] = 1
105: ppc_msgsnd_sync()/smp_mb()
105: ppc_msgsnd() -> 42
42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue
This can be prevented with an smp_mb() at the beginning of
kvmppc_set_host_ipi(), such that stores to message[<type>] (or other
state indicated by the host_ipi flag) are ordered vs. the store to
to host_ipi.
However, doing so might still allow for the following scenario (not
yet observed):
CPU
X: smp_muxed_ipi_set_message():
X: smp_mb()
X: message[RESCHEDULE] = 1
X: doorbell_global_ipi(42):
X: kvmppc_set_host_ipi(42, 1)
X: ppc_msgsnd_sync()/smp_mb()
X: ppc_msgsnd() -> 42
42: doorbell_exception(): // from CPU X
42: ppc_msgsync()
// STORE DEFERRED DUE TO RE-ORDERING
-- 42: kvmppc_set_host_ipi(42, 0)
| 42: smp_ipi_demux_relaxed()
| 105: smp_muxed_ipi_set_message():
| 105: smb_mb()
| 105: message[CALL_FUNCTION] = 1
| 105: doorbell_global_ipi(42):
| 105: kvmppc_set_host_ipi(42, 1)
| // RE-ORDERED STORE COMPLETES
-> 42: kvmppc_set_host_ipi(42, 0)
42: // returns to executing guest
105: ppc_msgsnd_sync()/smp_mb()
105: ppc_msgsnd() -> 42
42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue
Fixing this scenario would require an smp_mb() *after* clearing
host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
subsequent processing of IPI messages.
To handle both cases, this patch splits kvmppc_set_host_ipi() into
separate set/clear functions, where we execute smp_mb() prior to
setting host_ipi flag, and after clearing host_ipi flag. These
functions pair with each other to synchronize the sender and receiver
sides.
With that change in place the above workload ran for 20 hours without
triggering any lock-ups.
Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com
2019-09-12 01:31:55 +03:00
static inline void kvmppc_set_host_ipi ( int cpu )
{ }
static inline void kvmppc_clear_host_ipi ( int cpu )
2013-04-18 00:30:50 +04:00
{ }
static inline void kvmppc_fast_vcpu_kick ( struct kvm_vcpu * vcpu )
{
kvm_vcpu_kick ( vcpu ) ;
}
2014-05-23 12:15:25 +04:00
static inline bool kvm_hv_mode_active ( void ) { return false ; }
2013-04-18 00:30:26 +04:00
# endif
2022-04-03 15:12:52 +03:00
# ifdef CONFIG_PPC_PSERIES
static inline bool kvmhv_on_pseries ( void )
{
return ! cpu_has_feature ( CPU_FTR_HVMODE ) ;
}
# else
static inline bool kvmhv_on_pseries ( void )
{
return false ;
}
# endif
2013-04-18 00:30:26 +04:00
# ifdef CONFIG_KVM_XICS
static inline int kvmppc_xics_enabled ( struct kvm_vcpu * vcpu )
{
return vcpu - > arch . irq_type = = KVMPPC_IRQ_XICS ;
}
2016-08-19 08:35:48 +03:00
static inline struct kvmppc_passthru_irqmap * kvmppc_get_passthru_irqmap (
struct kvm * kvm )
{
2016-08-19 08:35:54 +03:00
if ( kvm & & kvm_irq_bypass )
2016-08-19 08:35:48 +03:00
return kvm - > arch . pimap ;
return NULL ;
}
2015-12-17 23:59:06 +03:00
extern void kvmppc_alloc_host_rm_ops ( void ) ;
extern void kvmppc_free_host_rm_ops ( void ) ;
2016-08-19 08:35:48 +03:00
extern void kvmppc_free_pimap ( struct kvm * kvm ) ;
2016-08-19 08:35:52 +03:00
extern int kvmppc_xics_rm_complete ( struct kvm_vcpu * vcpu , u32 hcall ) ;
2013-04-18 00:30:26 +04:00
extern void kvmppc_xics_free_icp ( struct kvm_vcpu * vcpu ) ;
extern int kvmppc_xics_hcall ( struct kvm_vcpu * vcpu , u32 cmd ) ;
KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path
In the interest of minimising the amount of code that is run in
"real-mode", don't handle hcalls in real mode in the P9 path. This
requires some new handlers for H_CEDE and xics-on-xive to be added
before xive is pulled or cede logic is checked.
This introduces a change in radix guest behaviour where radix guests
that execute 'sc 1' in userspace now get a privilege fault whereas
previously the 'sc 1' would be reflected as a syscall interrupt to the
guest kernel. That reflection is only required for hash guests that run
PR KVM.
Background:
In POWER8 and earlier processors, it is very expensive to exit from the
HV real mode context of a guest hypervisor interrupt, and switch to host
virtual mode. On those processors, guest->HV interrupts reach the
hypervisor with the MMU off because the MMU is loaded with guest context
(LPCR, SDR1, SLB), and the other threads in the sub-core need to be
pulled out of the guest too. Then the primary must save off guest state,
invalidate SLB and ERAT, and load up host state before the MMU can be
enabled to run in host virtual mode (~= regular Linux mode).
Hash guests also require a lot of hcalls to run due to the nature of the
MMU architecture and paravirtualisation design. The XICS interrupt
controller requires hcalls to run.
So KVM traditionally tries hard to avoid the full exit, by handling
hcalls and other interrupts in real mode as much as possible.
By contrast, POWER9 has independent MMU context per-thread, and in radix
mode the hypervisor is in host virtual memory mode when the HV interrupt
is taken. Radix guests do not require significant hcalls to manage their
translations, and xive guests don't need hcalls to handle interrupts. So
it's much less important for performance to handle hcalls in real mode on
POWER9.
One caveat is that the TCE hcalls are performance critical, real-mode
variants introduced for POWER8 in order to achieve 10GbE performance.
Real mode TCE hcalls were found to be less important on POWER9, which
was able to drive 40GBe networking without them (using the virt mode
hcalls) but performance is still important. These hcalls will benefit
from subsequent guest entry/exit optimisation including possibly a
faster "partial exit" that does not entirely switch to host context to
handle the hcall.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210528090752.3542186-14-npiggin@gmail.com
2021-05-28 12:07:33 +03:00
extern int kvmppc_xive_xics_hcall ( struct kvm_vcpu * vcpu , u32 req ) ;
2013-04-18 00:32:26 +04:00
extern u64 kvmppc_xics_get_icp ( struct kvm_vcpu * vcpu ) ;
extern int kvmppc_xics_set_icp ( struct kvm_vcpu * vcpu , u64 icpval ) ;
2013-04-27 04:28:37 +04:00
extern int kvmppc_xics_connect_vcpu ( struct kvm_device * dev ,
struct kvm_vcpu * vcpu , u32 cpu ) ;
2015-12-17 23:59:09 +03:00
extern void kvmppc_xics_ipi_action ( void ) ;
KVM: PPC: Book3S HV: Set server for passed-through interrupts
When a guest has a PCI pass-through device with an interrupt, it
will direct the interrupt to a particular guest VCPU. In fact the
physical interrupt might arrive on any CPU, and then get
delivered to the target VCPU in the emulated XICS (guest interrupt
controller), and eventually delivered to the target VCPU.
Now that we have code to handle device interrupts in real mode
without exiting to the host kernel, there is an advantage to having
the device interrupt arrive on the same sub(core) as the target
VCPU is running on. In this situation, the interrupt can be
delivered to the target VCPU without any exit to the host kernel
(using a hypervisor doorbell interrupt between threads if
necessary).
This patch aims to get passed-through device interrupts arriving
on the correct core by setting the interrupt server in the real
hardware XICS for the interrupt to the first thread in the (sub)core
where its target VCPU is running. We do this in the real-mode H_EOI
code because the H_EOI handler already needs to look at the
emulated ICS state for the interrupt (whereas the H_XIRR handler
doesn't), and we know we are running in the target VCPU context
at that point.
We set the server CPU in hardware using an OPAL call, regardless of
what the IRQ affinity mask for the interrupt says, and without
updating the affinity mask. This amounts to saying that when an
interrupt is passed through to a guest, as a matter of policy we
allow the guest's affinity for the interrupt to override the host's.
This is inspired by an earlier patch from Suresh Warrier, although
none of this code came from that earlier patch.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-08-19 08:35:56 +03:00
extern void kvmppc_xics_set_mapped ( struct kvm * kvm , unsigned long guest_irq ,
unsigned long host_irq ) ;
extern void kvmppc_xics_clr_mapped ( struct kvm * kvm , unsigned long guest_irq ,
unsigned long host_irq ) ;
KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9
POWER9 includes a new interrupt controller, called XIVE, which is
quite different from the XICS interrupt controller on POWER7 and
POWER8 machines. KVM-HV accesses the XICS directly in several places
in order to send and clear IPIs and handle interrupts from PCI
devices being passed through to the guest.
In order to make the transition to XIVE easier, OPAL firmware will
include an emulation of XICS on top of XIVE. Access to the emulated
XICS is via OPAL calls. The one complication is that the EOI
(end-of-interrupt) function can now return a value indicating that
another interrupt is pending; in this case, the XIVE will not signal
an interrupt in hardware to the CPU, and software is supposed to
acknowledge the new interrupt without waiting for another interrupt
to be delivered in hardware.
This adapts KVM-HV to use the OPAL calls on machines where there is
no XICS hardware. When there is no XICS, we look for a device-tree
node with "ibm,opal-intc" in its compatible property, which is how
OPAL indicates that it provides XICS emulation.
In order to handle the EOI return value, kvmppc_read_intr() has
become kvmppc_read_one_intr(), with a boolean variable passed by
reference which can be set by the EOI functions to indicate that
another interrupt is pending. The new kvmppc_read_intr() keeps
calling kvmppc_read_one_intr() until there are no more interrupts
to process. The return value from kvmppc_read_intr() is the
largest non-zero value of the returns from kvmppc_read_one_intr().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-18 01:02:08 +03:00
extern long kvmppc_deliver_irq_passthru ( struct kvm_vcpu * vcpu , __be32 xirr ,
struct kvmppc_irq_map * irq_map ,
struct kvmppc_passthru_irqmap * pimap ,
bool * again ) ;
2017-04-05 10:54:56 +03:00
extern int kvmppc_xics_set_irq ( struct kvm * kvm , int irq_source_id , u32 irq ,
int level , bool line_status ) ;
2015-12-22 01:33:57 +03:00
extern int h_ipi_redirect ;
2013-04-18 00:30:26 +04:00
# else
2016-08-19 08:35:48 +03:00
static inline struct kvmppc_passthru_irqmap * kvmppc_get_passthru_irqmap (
struct kvm * kvm )
{ return NULL ; }
2021-01-25 12:53:38 +03:00
static inline void kvmppc_alloc_host_rm_ops ( void ) { }
static inline void kvmppc_free_host_rm_ops ( void ) { }
static inline void kvmppc_free_pimap ( struct kvm * kvm ) { }
2016-08-19 08:35:52 +03:00
static inline int kvmppc_xics_rm_complete ( struct kvm_vcpu * vcpu , u32 hcall )
{ return 0 ; }
2013-04-18 00:30:26 +04:00
static inline int kvmppc_xics_enabled ( struct kvm_vcpu * vcpu )
{ return 0 ; }
static inline void kvmppc_xics_free_icp ( struct kvm_vcpu * vcpu ) { }
static inline int kvmppc_xics_hcall ( struct kvm_vcpu * vcpu , u32 cmd )
{ return 0 ; }
KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path
In the interest of minimising the amount of code that is run in
"real-mode", don't handle hcalls in real mode in the P9 path. This
requires some new handlers for H_CEDE and xics-on-xive to be added
before xive is pulled or cede logic is checked.
This introduces a change in radix guest behaviour where radix guests
that execute 'sc 1' in userspace now get a privilege fault whereas
previously the 'sc 1' would be reflected as a syscall interrupt to the
guest kernel. That reflection is only required for hash guests that run
PR KVM.
Background:
In POWER8 and earlier processors, it is very expensive to exit from the
HV real mode context of a guest hypervisor interrupt, and switch to host
virtual mode. On those processors, guest->HV interrupts reach the
hypervisor with the MMU off because the MMU is loaded with guest context
(LPCR, SDR1, SLB), and the other threads in the sub-core need to be
pulled out of the guest too. Then the primary must save off guest state,
invalidate SLB and ERAT, and load up host state before the MMU can be
enabled to run in host virtual mode (~= regular Linux mode).
Hash guests also require a lot of hcalls to run due to the nature of the
MMU architecture and paravirtualisation design. The XICS interrupt
controller requires hcalls to run.
So KVM traditionally tries hard to avoid the full exit, by handling
hcalls and other interrupts in real mode as much as possible.
By contrast, POWER9 has independent MMU context per-thread, and in radix
mode the hypervisor is in host virtual memory mode when the HV interrupt
is taken. Radix guests do not require significant hcalls to manage their
translations, and xive guests don't need hcalls to handle interrupts. So
it's much less important for performance to handle hcalls in real mode on
POWER9.
One caveat is that the TCE hcalls are performance critical, real-mode
variants introduced for POWER8 in order to achieve 10GbE performance.
Real mode TCE hcalls were found to be less important on POWER9, which
was able to drive 40GBe networking without them (using the virt mode
hcalls) but performance is still important. These hcalls will benefit
from subsequent guest entry/exit optimisation including possibly a
faster "partial exit" that does not entirely switch to host context to
handle the hcall.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210528090752.3542186-14-npiggin@gmail.com
2021-05-28 12:07:33 +03:00
static inline int kvmppc_xive_xics_hcall ( struct kvm_vcpu * vcpu , u32 req )
{ return 0 ; }
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 04:23:08 +04:00
# endif
2017-04-05 10:54:56 +03:00
# ifdef CONFIG_KVM_XIVE
/*
* Below the first " xive " is the " eXternal Interrupt Virtualization Engine "
* ie . P9 new interrupt controller , while the second " xive " is the legacy
* " eXternal Interrupt Vector Entry " which is the configuration of an
* interrupt on the " xics " interrupt controller on P8 and earlier . Those
* two function consume or produce a legacy " XIVE " state from the
* new " XIVE " interrupt controller .
*/
extern int kvmppc_xive_set_xive ( struct kvm * kvm , u32 irq , u32 server ,
u32 priority ) ;
extern int kvmppc_xive_get_xive ( struct kvm * kvm , u32 irq , u32 * server ,
u32 * priority ) ;
extern int kvmppc_xive_int_on ( struct kvm * kvm , u32 irq ) ;
extern int kvmppc_xive_int_off ( struct kvm * kvm , u32 irq ) ;
extern int kvmppc_xive_connect_vcpu ( struct kvm_device * dev ,
struct kvm_vcpu * vcpu , u32 cpu ) ;
extern void kvmppc_xive_cleanup_vcpu ( struct kvm_vcpu * vcpu ) ;
extern int kvmppc_xive_set_mapped ( struct kvm * kvm , unsigned long guest_irq ,
2021-07-01 16:27:32 +03:00
unsigned long host_irq ) ;
2017-04-05 10:54:56 +03:00
extern int kvmppc_xive_clr_mapped ( struct kvm * kvm , unsigned long guest_irq ,
2021-07-01 16:27:32 +03:00
unsigned long host_irq ) ;
2017-04-05 10:54:56 +03:00
extern u64 kvmppc_xive_get_icp ( struct kvm_vcpu * vcpu ) ;
extern int kvmppc_xive_set_icp ( struct kvm_vcpu * vcpu , u64 icpval ) ;
extern int kvmppc_xive_set_irq ( struct kvm * kvm , int irq_source_id , u32 irq ,
int level , bool line_status ) ;
2018-10-08 08:30:55 +03:00
extern void kvmppc_xive_push_vcpu ( struct kvm_vcpu * vcpu ) ;
2021-05-28 12:07:28 +03:00
extern void kvmppc_xive_pull_vcpu ( struct kvm_vcpu * vcpu ) ;
2022-03-03 08:33:12 +03:00
extern bool kvmppc_xive_rearm_escalation ( struct kvm_vcpu * vcpu ) ;
2019-04-18 13:39:27 +03:00
2019-04-18 13:39:28 +03:00
static inline int kvmppc_xive_enabled ( struct kvm_vcpu * vcpu )
{
return vcpu - > arch . irq_type = = KVMPPC_IRQ_XIVE ;
}
extern int kvmppc_xive_native_connect_vcpu ( struct kvm_device * dev ,
struct kvm_vcpu * vcpu , u32 cpu ) ;
extern void kvmppc_xive_native_cleanup_vcpu ( struct kvm_vcpu * vcpu ) ;
2019-04-18 13:39:35 +03:00
extern int kvmppc_xive_native_get_vp ( struct kvm_vcpu * vcpu ,
union kvmppc_one_reg * val ) ;
extern int kvmppc_xive_native_set_vp ( struct kvm_vcpu * vcpu ,
union kvmppc_one_reg * val ) ;
2019-08-26 09:21:21 +03:00
extern bool kvmppc_xive_native_supported ( void ) ;
2019-04-18 13:39:27 +03:00
2017-04-05 10:54:56 +03:00
# else
static inline int kvmppc_xive_set_xive ( struct kvm * kvm , u32 irq , u32 server ,
u32 priority ) { return - 1 ; }
static inline int kvmppc_xive_get_xive ( struct kvm * kvm , u32 irq , u32 * server ,
u32 * priority ) { return - 1 ; }
static inline int kvmppc_xive_int_on ( struct kvm * kvm , u32 irq ) { return - 1 ; }
static inline int kvmppc_xive_int_off ( struct kvm * kvm , u32 irq ) { return - 1 ; }
static inline int kvmppc_xive_connect_vcpu ( struct kvm_device * dev ,
struct kvm_vcpu * vcpu , u32 cpu ) { return - EBUSY ; }
static inline void kvmppc_xive_cleanup_vcpu ( struct kvm_vcpu * vcpu ) { }
static inline int kvmppc_xive_set_mapped ( struct kvm * kvm , unsigned long guest_irq ,
struct irq_desc * host_desc ) { return - ENODEV ; }
static inline int kvmppc_xive_clr_mapped ( struct kvm * kvm , unsigned long guest_irq ,
struct irq_desc * host_desc ) { return - ENODEV ; }
static inline u64 kvmppc_xive_get_icp ( struct kvm_vcpu * vcpu ) { return 0 ; }
static inline int kvmppc_xive_set_icp ( struct kvm_vcpu * vcpu , u64 icpval ) { return - ENOENT ; }
static inline int kvmppc_xive_set_irq ( struct kvm * kvm , int irq_source_id , u32 irq ,
int level , bool line_status ) { return - ENODEV ; }
2018-10-08 08:30:55 +03:00
static inline void kvmppc_xive_push_vcpu ( struct kvm_vcpu * vcpu ) { }
2021-05-28 12:07:28 +03:00
static inline void kvmppc_xive_pull_vcpu ( struct kvm_vcpu * vcpu ) { }
2022-03-03 08:33:12 +03:00
static inline bool kvmppc_xive_rearm_escalation ( struct kvm_vcpu * vcpu ) { return true ; }
2019-04-18 13:39:27 +03:00
2019-04-18 13:39:28 +03:00
static inline int kvmppc_xive_enabled ( struct kvm_vcpu * vcpu )
{ return 0 ; }
static inline int kvmppc_xive_native_connect_vcpu ( struct kvm_device * dev ,
struct kvm_vcpu * vcpu , u32 cpu ) { return - EBUSY ; }
static inline void kvmppc_xive_native_cleanup_vcpu ( struct kvm_vcpu * vcpu ) { }
2019-04-18 13:39:35 +03:00
static inline int kvmppc_xive_native_get_vp ( struct kvm_vcpu * vcpu ,
union kvmppc_one_reg * val )
{ return 0 ; }
static inline int kvmppc_xive_native_set_vp ( struct kvm_vcpu * vcpu ,
union kvmppc_one_reg * val )
{ return - ENOENT ; }
2019-04-18 13:39:27 +03:00
2017-04-05 10:54:56 +03:00
# endif /* CONFIG_KVM_XIVE */
2019-02-25 06:35:06 +03:00
# if defined(CONFIG_PPC_POWERNV) && defined(CONFIG_KVM_BOOK3S_64_HANDLER)
KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE
Currently, the KVM code assumes that if the host kernel is using the
XIVE interrupt controller (the new interrupt controller that first
appeared in POWER9 systems), then the in-kernel XICS emulation will
use the XIVE hardware to deliver interrupts to the guest. However,
this only works when the host is running in hypervisor mode and has
full access to all of the XIVE functionality. It doesn't work in any
nested virtualization scenario, either with PR KVM or nested-HV KVM,
because the XICS-on-XIVE code calls directly into the native-XIVE
routines, which are not initialized and cannot function correctly
because they use OPAL calls, and OPAL is not available in a guest.
This means that using the in-kernel XICS emulation in a nested
hypervisor that is using XIVE as its interrupt controller will cause a
(nested) host kernel crash. To fix this, we change most of the places
where the current code calls xive_enabled() to select between the
XICS-on-XIVE emulation and the plain XICS emulation to call a new
function, xics_on_xive(), which returns false in a guest.
However, there is a further twist. The plain XICS emulation has some
functions which are used in real mode and access the underlying XICS
controller (the interrupt controller of the host) directly. In the
case of a nested hypervisor, this means doing XICS hypercalls
directly. When the nested host is using XIVE as its interrupt
controller, these hypercalls will fail. Therefore this also adds
checks in the places where the XICS emulation wants to access the
underlying interrupt controller directly, and if that is XIVE, makes
the code use the virtual mode fallback paths, which call generic
kernel infrastructure rather than doing direct XICS access.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-04 14:07:20 +03:00
static inline bool xics_on_xive ( void )
{
return xive_enabled ( ) & & cpu_has_feature ( CPU_FTR_HVMODE ) ;
}
# else
static inline bool xics_on_xive ( void )
{
return false ;
}
# endif
2016-12-01 06:03:46 +03:00
/*
* Prototypes for functions called only from assembler code .
* Having prototypes reduces sparse errors .
*/
long kvmppc_rm_h_put_tce ( struct kvm_vcpu * vcpu , unsigned long liobn ,
unsigned long ioba , unsigned long tce ) ;
long kvmppc_rm_h_put_tce_indirect ( struct kvm_vcpu * vcpu ,
unsigned long liobn , unsigned long ioba ,
unsigned long tce_list , unsigned long npages ) ;
long kvmppc_rm_h_stuff_tce ( struct kvm_vcpu * vcpu ,
unsigned long liobn , unsigned long ioba ,
unsigned long tce_value , unsigned long npages ) ;
long int kvmppc_rm_h_confer ( struct kvm_vcpu * vcpu , int target ,
unsigned int yield_count ) ;
2021-05-28 12:07:44 +03:00
long kvmppc_rm_h_random ( struct kvm_vcpu * vcpu ) ;
2016-12-01 06:03:46 +03:00
void kvmhv_commence_exit ( int trap ) ;
KVM: PPC: Book3S HV: Simplify machine check handling
This makes the handling of machine check interrupts that occur inside
a guest simpler and more robust, with less done in assembler code and
in real mode.
Now, when a machine check occurs inside a guest, we always get the
machine check event struct and put a copy in the vcpu struct for the
vcpu where the machine check occurred. We no longer call
machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
on POWER8, when a vcpu is running on an offline secondary thread and
we call machine_check_queue_event(), that calls irq_work_queue(),
which doesn't work because the CPU is offline, but instead triggers
the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
fires again and again because nothing clears the condition).
All that machine_check_queue_event() actually does is to cause the
event to be printed to the console. For a machine check occurring in
the guest, we now print the event in kvmppc_handle_exit_hv()
instead.
The assembly code at label machine_check_realmode now just calls C
code and then continues exiting the guest. We no longer either
synthesize a machine check for the guest in assembly code or return
to the guest without a machine check.
The code in kvmppc_handle_exit_hv() is extended to handle the case
where the guest is not FWNMI-capable. In that case we now always
synthesize a machine check interrupt for the guest. Previously, if
the host thinks it has recovered the machine check fully, it would
return to the guest without any notification that the machine check
had occurred. If the machine check was caused by some action of the
guest (such as creating duplicate SLB entries), it is much better to
tell the guest that it has caused a problem. Therefore we now always
generate a machine check interrupt for guests that are not
FWNMI-capable.
Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21 05:38:49 +03:00
void kvmppc_realmode_machine_check ( struct kvm_vcpu * vcpu ) ;
2016-12-01 06:03:46 +03:00
void kvmppc_subcore_enter_guest ( void ) ;
void kvmppc_subcore_exit_guest ( void ) ;
long kvmppc_realmode_hmi_handler ( void ) ;
2021-11-23 12:52:31 +03:00
long kvmppc_p9_realmode_hmi_handler ( struct kvm_vcpu * vcpu ) ;
2016-12-01 06:03:46 +03:00
long kvmppc_h_enter ( struct kvm_vcpu * vcpu , unsigned long flags ,
long pte_index , unsigned long pteh , unsigned long ptel ) ;
long kvmppc_h_remove ( struct kvm_vcpu * vcpu , unsigned long flags ,
unsigned long pte_index , unsigned long avpn ) ;
long kvmppc_h_bulk_remove ( struct kvm_vcpu * vcpu ) ;
long kvmppc_h_protect ( struct kvm_vcpu * vcpu , unsigned long flags ,
2021-04-12 04:48:40 +03:00
unsigned long pte_index , unsigned long avpn ) ;
2016-12-01 06:03:46 +03:00
long kvmppc_h_read ( struct kvm_vcpu * vcpu , unsigned long flags ,
unsigned long pte_index ) ;
long kvmppc_h_clear_ref ( struct kvm_vcpu * vcpu , unsigned long flags ,
unsigned long pte_index ) ;
long kvmppc_h_clear_mod ( struct kvm_vcpu * vcpu , unsigned long flags ,
unsigned long pte_index ) ;
2019-03-22 09:05:45 +03:00
long kvmppc_rm_h_page_init ( struct kvm_vcpu * vcpu , unsigned long flags ,
unsigned long dest , unsigned long src ) ;
2016-12-01 06:03:46 +03:00
long kvmppc_hpte_hv_fault ( struct kvm_vcpu * vcpu , unsigned long addr ,
unsigned long slb_v , unsigned int status , bool data ) ;
unsigned long kvmppc_rm_h_xirr ( struct kvm_vcpu * vcpu ) ;
2017-04-05 10:54:56 +03:00
unsigned long kvmppc_rm_h_xirr_x ( struct kvm_vcpu * vcpu ) ;
unsigned long kvmppc_rm_h_ipoll ( struct kvm_vcpu * vcpu , unsigned long server ) ;
2016-12-01 06:03:46 +03:00
int kvmppc_rm_h_ipi ( struct kvm_vcpu * vcpu , unsigned long server ,
unsigned long mfrr ) ;
int kvmppc_rm_h_cppr ( struct kvm_vcpu * vcpu , unsigned long cppr ) ;
int kvmppc_rm_h_eoi ( struct kvm_vcpu * vcpu , unsigned long xirr ) ;
2018-10-08 08:30:50 +03:00
void kvmppc_guest_entry_inject_int ( struct kvm_vcpu * vcpu ) ;
2016-12-01 06:03:46 +03:00
2015-12-17 23:59:06 +03:00
/*
* Host - side operations we want to set up while running in real
* mode in the guest operating on the xics .
* Currently only VCPU wakeup is supported .
*/
union kvmppc_rm_state {
unsigned long raw ;
struct {
u32 in_host ;
u32 rm_action ;
} ;
} ;
struct kvmppc_host_rm_core {
union kvmppc_rm_state rm_state ;
void * rm_data ;
char pad [ 112 ] ;
} ;
struct kvmppc_host_rm_ops {
struct kvmppc_host_rm_core * rm_core ;
void ( * vcpu_kick ) ( struct kvm_vcpu * vcpu ) ;
} ;
extern struct kvmppc_host_rm_ops * kvmppc_host_rm_ops_hv ;
2014-07-17 15:31:40 +04:00
static inline unsigned long kvmppc_get_epr ( struct kvm_vcpu * vcpu )
{
# ifdef CONFIG_KVM_BOOKE_HV
return mfspr ( SPRN_GEPR ) ;
# elif defined(CONFIG_BOOKE)
return vcpu - > arch . epr ;
# else
return 0 ;
# endif
}
2013-01-04 21:12:48 +04:00
static inline void kvmppc_set_epr ( struct kvm_vcpu * vcpu , u32 epr )
{
# ifdef CONFIG_KVM_BOOKE_HV
mtspr ( SPRN_GEPR , epr ) ;
# elif defined(CONFIG_BOOKE)
vcpu - > arch . epr = epr ;
# endif
}
2013-04-12 18:08:46 +04:00
# ifdef CONFIG_KVM_MPIC
void kvmppc_mpic_set_epr ( struct kvm_vcpu * vcpu ) ;
2013-04-12 18:08:47 +04:00
int kvmppc_mpic_connect_vcpu ( struct kvm_device * dev , struct kvm_vcpu * vcpu ,
u32 cpu ) ;
void kvmppc_mpic_disconnect_vcpu ( struct openpic * opp , struct kvm_vcpu * vcpu ) ;
2013-04-12 18:08:46 +04:00
# else
static inline void kvmppc_mpic_set_epr ( struct kvm_vcpu * vcpu )
{
}
2013-04-12 18:08:47 +04:00
static inline int kvmppc_mpic_connect_vcpu ( struct kvm_device * dev ,
struct kvm_vcpu * vcpu , u32 cpu )
{
return - EINVAL ;
}
static inline void kvmppc_mpic_disconnect_vcpu ( struct openpic * opp ,
struct kvm_vcpu * vcpu )
{
}
2013-04-12 18:08:46 +04:00
# endif /* CONFIG_KVM_MPIC */
2011-08-19 00:25:21 +04:00
int kvm_vcpu_ioctl_config_tlb ( struct kvm_vcpu * vcpu ,
struct kvm_config_tlb * cfg ) ;
int kvm_vcpu_ioctl_dirty_tlb ( struct kvm_vcpu * vcpu ,
struct kvm_dirty_tlb * cfg ) ;
2011-12-20 19:34:20 +04:00
long kvmppc_alloc_lpid ( void ) ;
void kvmppc_free_lpid ( long lpid ) ;
void kvmppc_init_lpid ( unsigned long nr_lpids ) ;
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 03:56:11 +03:00
static inline void kvmppc_mmu_flush_icache ( kvm_pfn_t pfn )
2012-08-03 15:56:33 +04:00
{
struct page * page ;
2013-04-25 10:33:57 +04:00
/*
* We can only access pages that the kernel maps
* as memory . Bail out for unmapped ones .
*/
if ( ! pfn_valid ( pfn ) )
return ;
/* Clear i-cache for new pages */
2012-08-03 15:56:33 +04:00
page = pfn_to_page ( pfn ) ;
2021-02-03 07:58:11 +03:00
if ( ! test_bit ( PG_dcache_clean , & page - > flags ) ) {
2012-08-03 15:56:33 +04:00
flush_dcache_icache_page ( page ) ;
2021-02-03 07:58:11 +03:00
set_bit ( PG_dcache_clean , & page - > flags ) ;
2012-08-03 15:56:33 +04:00
}
}
2014-04-24 15:46:24 +04:00
/*
* Shared struct helpers . The shared struct can be little or big endian ,
* depending on the guest endianness . So expose helpers to all of them .
*/
static inline bool kvmppc_shared_big_endian ( struct kvm_vcpu * vcpu )
{
# if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_PR_POSSIBLE)
/* Only Book3S_64 PR supports bi-endian for now */
return vcpu - > arch . shared_big_endian ;
# elif defined(CONFIG_PPC_BOOK3S_64) && defined(__LITTLE_ENDIAN__)
/* Book3s_64 HV on little endian is always little endian */
return false ;
# else
return true ;
# endif
}
2014-07-30 13:33:56 +04:00
# define SPRNG_WRAPPER_GET(reg, bookehv_spr) \
2014-07-17 15:31:35 +04:00
static inline ulong kvmppc_get_ # # reg ( struct kvm_vcpu * vcpu ) \
{ \
2014-07-30 13:33:56 +04:00
return mfspr ( bookehv_spr ) ; \
2014-07-17 15:31:35 +04:00
} \
2014-07-30 13:33:56 +04:00
# define SPRNG_WRAPPER_SET(reg, bookehv_spr) \
2014-07-17 15:31:35 +04:00
static inline void kvmppc_set_ # # reg ( struct kvm_vcpu * vcpu , ulong val ) \
{ \
2014-07-30 13:33:56 +04:00
mtspr ( bookehv_spr , val ) ; \
2014-07-17 15:31:35 +04:00
} \
2014-04-24 15:46:24 +04:00
# define SHARED_WRAPPER_GET(reg, size) \
2014-07-17 15:31:35 +04:00
static inline u # # size kvmppc_get_ # # reg ( struct kvm_vcpu * vcpu ) \
2014-04-24 15:46:24 +04:00
{ \
if ( kvmppc_shared_big_endian ( vcpu ) ) \
return be # # size # # _to_cpu ( vcpu - > arch . shared - > reg ) ; \
else \
return le # # size # # _to_cpu ( vcpu - > arch . shared - > reg ) ; \
} \
# define SHARED_WRAPPER_SET(reg, size) \
static inline void kvmppc_set_ # # reg ( struct kvm_vcpu * vcpu , u # # size val ) \
{ \
if ( kvmppc_shared_big_endian ( vcpu ) ) \
vcpu - > arch . shared - > reg = cpu_to_be # # size ( val ) ; \
else \
vcpu - > arch . shared - > reg = cpu_to_le # # size ( val ) ; \
} \
# define SHARED_WRAPPER(reg, size) \
SHARED_WRAPPER_GET ( reg , size ) \
SHARED_WRAPPER_SET ( reg , size ) \
2014-07-30 13:33:56 +04:00
# define SPRNG_WRAPPER(reg, bookehv_spr) \
SPRNG_WRAPPER_GET ( reg , bookehv_spr ) \
SPRNG_WRAPPER_SET ( reg , bookehv_spr ) \
2014-07-17 15:31:35 +04:00
# ifdef CONFIG_KVM_BOOKE_HV
2014-07-30 13:33:56 +04:00
# define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr) \
SPRNG_WRAPPER ( reg , bookehv_spr ) \
2014-07-17 15:31:35 +04:00
# else
2014-07-30 13:33:56 +04:00
# define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr) \
2014-07-17 15:31:35 +04:00
SHARED_WRAPPER ( reg , size ) \
# endif
2014-04-24 15:46:24 +04:00
SHARED_WRAPPER ( critical , 64 )
2014-07-17 15:31:35 +04:00
SHARED_SPRNG_WRAPPER ( sprg0 , 64 , SPRN_GSPRG0 )
SHARED_SPRNG_WRAPPER ( sprg1 , 64 , SPRN_GSPRG1 )
SHARED_SPRNG_WRAPPER ( sprg2 , 64 , SPRN_GSPRG2 )
SHARED_SPRNG_WRAPPER ( sprg3 , 64 , SPRN_GSPRG3 )
SHARED_SPRNG_WRAPPER ( srr0 , 64 , SPRN_GSRR0 )
SHARED_SPRNG_WRAPPER ( srr1 , 64 , SPRN_GSRR1 )
SHARED_SPRNG_WRAPPER ( dar , 64 , SPRN_GDEAR )
2014-07-17 15:31:38 +04:00
SHARED_SPRNG_WRAPPER ( esr , 64 , SPRN_GESR )
2014-04-24 15:46:24 +04:00
SHARED_WRAPPER_GET ( msr , 64 )
static inline void kvmppc_set_msr_fast ( struct kvm_vcpu * vcpu , u64 val )
{
if ( kvmppc_shared_big_endian ( vcpu ) )
vcpu - > arch . shared - > msr = cpu_to_be64 ( val ) ;
else
vcpu - > arch . shared - > msr = cpu_to_le64 ( val ) ;
}
SHARED_WRAPPER ( dsisr , 32 )
SHARED_WRAPPER ( int_pending , 32 )
SHARED_WRAPPER ( sprg4 , 64 )
SHARED_WRAPPER ( sprg5 , 64 )
SHARED_WRAPPER ( sprg6 , 64 )
SHARED_WRAPPER ( sprg7 , 64 )
static inline u32 kvmppc_get_sr ( struct kvm_vcpu * vcpu , int nr )
{
if ( kvmppc_shared_big_endian ( vcpu ) )
return be32_to_cpu ( vcpu - > arch . shared - > sr [ nr ] ) ;
else
return le32_to_cpu ( vcpu - > arch . shared - > sr [ nr ] ) ;
}
static inline void kvmppc_set_sr ( struct kvm_vcpu * vcpu , int nr , u32 val )
{
if ( kvmppc_shared_big_endian ( vcpu ) )
vcpu - > arch . shared - > sr [ nr ] = cpu_to_be32 ( val ) ;
else
vcpu - > arch . shared - > sr [ nr ] = cpu_to_le32 ( val ) ;
}
2013-07-11 02:47:39 +04:00
/*
* Please call after prepare_to_enter . This function puts the lazy ee and irq
* disabled tracking state back to normal mode , without actually enabling
* interrupts .
*/
static inline void kvmppc_fix_ee_before_entry ( void )
2012-08-13 03:04:19 +04:00
{
2013-07-11 02:47:39 +04:00
trace_hardirqs_on ( ) ;
2012-08-13 03:04:19 +04:00
# ifdef CONFIG_PPC64
2014-01-10 05:18:40 +04:00
/*
* To avoid races , the caller must have gone directly from having
* interrupts fully - enabled to hard - disabled .
*/
WARN_ON ( local_paca - > irq_happened ! = PACA_IRQ_HARD_DIS ) ;
2012-08-13 03:04:19 +04:00
/* Only need to enable IRQs by hard enabling them after this */
local_paca - > irq_happened = 0 ;
2017-12-20 06:55:50 +03:00
irq_soft_mask_set ( IRQS_ENABLED ) ;
2012-08-13 03:04:19 +04:00
# endif
}
2012-08-03 15:56:33 +04:00
2012-10-11 10:13:22 +04:00
static inline ulong kvmppc_get_ea_indexed ( struct kvm_vcpu * vcpu , int ra , int rb )
{
ulong ea ;
2012-10-11 10:13:23 +04:00
ulong msr_64bit = 0 ;
2012-10-11 10:13:22 +04:00
ea = kvmppc_get_gpr ( vcpu , rb ) ;
if ( ra )
ea + = kvmppc_get_gpr ( vcpu , ra ) ;
2012-10-11 10:13:23 +04:00
# if defined(CONFIG_PPC_BOOK3E_64)
msr_64bit = MSR_CM ;
# elif defined(CONFIG_PPC_BOOK3S_64)
msr_64bit = MSR_SF ;
# endif
2014-04-24 15:46:24 +04:00
if ( ! ( kvmppc_get_msr ( vcpu ) & msr_64bit ) )
2012-10-11 10:13:23 +04:00
ea = ( uint32_t ) ea ;
2012-10-11 10:13:22 +04:00
return ea ;
}
2013-04-18 00:30:50 +04:00
extern void xics_wake_cpu ( int cpu ) ;
2008-04-17 08:28:09 +04:00
# endif /* __POWERPC_KVM_PPC_H__ */