583cda1b0e
Userspace can assign a PMU to a VCPU with the KVM_ARM_VCPU_PMU_V3_SET_PMU device ioctl. If the VCPU is scheduled on a physical CPU which has a different PMU, the perf events needed to emulate a guest PMU won't be scheduled in and the guest performance counters will stop counting. Treat it as an userspace error and refuse to run the VCPU in this situation. Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220127161759.53553-7-alexandru.elisei@arm.com
266 lines
10 KiB
ReStructuredText
266 lines
10 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
======================
|
|
Generic vcpu interface
|
|
======================
|
|
|
|
The virtual cpu "device" also accepts the ioctls KVM_SET_DEVICE_ATTR,
|
|
KVM_GET_DEVICE_ATTR, and KVM_HAS_DEVICE_ATTR. The interface uses the same struct
|
|
kvm_device_attr as other devices, but targets VCPU-wide settings and controls.
|
|
|
|
The groups and attributes per virtual cpu, if any, are architecture specific.
|
|
|
|
1. GROUP: KVM_ARM_VCPU_PMU_V3_CTRL
|
|
==================================
|
|
|
|
:Architectures: ARM64
|
|
|
|
1.1. ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_IRQ
|
|
---------------------------------------
|
|
|
|
:Parameters: in kvm_device_attr.addr the address for PMU overflow interrupt is a
|
|
pointer to an int
|
|
|
|
Returns:
|
|
|
|
======= ========================================================
|
|
-EBUSY The PMU overflow interrupt is already set
|
|
-EFAULT Error reading interrupt number
|
|
-ENXIO PMUv3 not supported or the overflow interrupt not set
|
|
when attempting to get it
|
|
-ENODEV KVM_ARM_VCPU_PMU_V3 feature missing from VCPU
|
|
-EINVAL Invalid PMU overflow interrupt number supplied or
|
|
trying to set the IRQ number without using an in-kernel
|
|
irqchip.
|
|
======= ========================================================
|
|
|
|
A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt
|
|
number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt
|
|
type must be same for each vcpu. As a PPI, the interrupt number is the same for
|
|
all vcpus, while as an SPI it must be a separate number per vcpu.
|
|
|
|
1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT
|
|
---------------------------------------
|
|
|
|
:Parameters: no additional parameter in kvm_device_attr.addr
|
|
|
|
Returns:
|
|
|
|
======= ======================================================
|
|
-EEXIST Interrupt number already used
|
|
-ENODEV PMUv3 not supported or GIC not initialized
|
|
-ENXIO PMUv3 not supported, missing VCPU feature or interrupt
|
|
number not set
|
|
-EBUSY PMUv3 already initialized
|
|
======= ======================================================
|
|
|
|
Request the initialization of the PMUv3. If using the PMUv3 with an in-kernel
|
|
virtual GIC implementation, this must be done after initializing the in-kernel
|
|
irqchip.
|
|
|
|
1.3 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_FILTER
|
|
-----------------------------------------
|
|
|
|
:Parameters: in kvm_device_attr.addr the address for a PMU event filter is a
|
|
pointer to a struct kvm_pmu_event_filter
|
|
|
|
:Returns:
|
|
|
|
======= ======================================================
|
|
-ENODEV PMUv3 not supported or GIC not initialized
|
|
-ENXIO PMUv3 not properly configured or in-kernel irqchip not
|
|
configured as required prior to calling this attribute
|
|
-EBUSY PMUv3 already initialized or a VCPU has already run
|
|
-EINVAL Invalid filter range
|
|
======= ======================================================
|
|
|
|
Request the installation of a PMU event filter described as follows::
|
|
|
|
struct kvm_pmu_event_filter {
|
|
__u16 base_event;
|
|
__u16 nevents;
|
|
|
|
#define KVM_PMU_EVENT_ALLOW 0
|
|
#define KVM_PMU_EVENT_DENY 1
|
|
|
|
__u8 action;
|
|
__u8 pad[3];
|
|
};
|
|
|
|
A filter range is defined as the range [@base_event, @base_event + @nevents),
|
|
together with an @action (KVM_PMU_EVENT_ALLOW or KVM_PMU_EVENT_DENY). The
|
|
first registered range defines the global policy (global ALLOW if the first
|
|
@action is DENY, global DENY if the first @action is ALLOW). Multiple ranges
|
|
can be programmed, and must fit within the event space defined by the PMU
|
|
architecture (10 bits on ARMv8.0, 16 bits from ARMv8.1 onwards).
|
|
|
|
Note: "Cancelling" a filter by registering the opposite action for the same
|
|
range doesn't change the default action. For example, installing an ALLOW
|
|
filter for event range [0:10) as the first filter and then applying a DENY
|
|
action for the same range will leave the whole range as disabled.
|
|
|
|
Restrictions: Event 0 (SW_INCR) is never filtered, as it doesn't count a
|
|
hardware event. Filtering event 0x1E (CHAIN) has no effect either, as it
|
|
isn't strictly speaking an event. Filtering the cycle counter is possible
|
|
using event 0x11 (CPU_CYCLES).
|
|
|
|
1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_SET_PMU
|
|
------------------------------------------
|
|
|
|
:Parameters: in kvm_device_attr.addr the address to an int representing the PMU
|
|
identifier.
|
|
|
|
:Returns:
|
|
|
|
======= ====================================================
|
|
-EBUSY PMUv3 already initialized, a VCPU has already run or
|
|
an event filter has already been set
|
|
-EFAULT Error accessing the PMU identifier
|
|
-ENXIO PMU not found
|
|
-ENODEV PMUv3 not supported or GIC not initialized
|
|
-ENOMEM Could not allocate memory
|
|
======= ====================================================
|
|
|
|
Request that the VCPU uses the specified hardware PMU when creating guest events
|
|
for the purpose of PMU emulation. The PMU identifier can be read from the "type"
|
|
file for the desired PMU instance under /sys/devices (or, equivalent,
|
|
/sys/bus/even_source). This attribute is particularly useful on heterogeneous
|
|
systems where there are at least two CPU PMUs on the system. The PMU that is set
|
|
for one VCPU will be used by all the other VCPUs. It isn't possible to set a PMU
|
|
if a PMU event filter is already present.
|
|
|
|
Note that KVM will not make any attempts to run the VCPU on the physical CPUs
|
|
associated with the PMU specified by this attribute. This is entirely left to
|
|
userspace. However, attempting to run the VCPU on a physical CPU not supported
|
|
by the PMU will fail and KVM_RUN will return with
|
|
exit_reason = KVM_EXIT_FAIL_ENTRY and populate the fail_entry struct by setting
|
|
hardare_entry_failure_reason field to KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED and
|
|
the cpu field to the processor id.
|
|
|
|
2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
|
|
=================================
|
|
|
|
:Architectures: ARM, ARM64
|
|
|
|
2.1. ATTRIBUTES: KVM_ARM_VCPU_TIMER_IRQ_VTIMER, KVM_ARM_VCPU_TIMER_IRQ_PTIMER
|
|
-----------------------------------------------------------------------------
|
|
|
|
:Parameters: in kvm_device_attr.addr the address for the timer interrupt is a
|
|
pointer to an int
|
|
|
|
Returns:
|
|
|
|
======= =================================
|
|
-EINVAL Invalid timer interrupt number
|
|
-EBUSY One or more VCPUs has already run
|
|
======= =================================
|
|
|
|
A value describing the architected timer interrupt number when connected to an
|
|
in-kernel virtual GIC. These must be a PPI (16 <= intid < 32). Setting the
|
|
attribute overrides the default values (see below).
|
|
|
|
============================= ==========================================
|
|
KVM_ARM_VCPU_TIMER_IRQ_VTIMER The EL1 virtual timer intid (default: 27)
|
|
KVM_ARM_VCPU_TIMER_IRQ_PTIMER The EL1 physical timer intid (default: 30)
|
|
============================= ==========================================
|
|
|
|
Setting the same PPI for different timers will prevent the VCPUs from running.
|
|
Setting the interrupt number on a VCPU configures all VCPUs created at that
|
|
time to use the number provided for a given timer, overwriting any previously
|
|
configured values on other VCPUs. Userspace should configure the interrupt
|
|
numbers on at least one VCPU after creating all VCPUs and before running any
|
|
VCPUs.
|
|
|
|
3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL
|
|
==================================
|
|
|
|
:Architectures: ARM64
|
|
|
|
3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_IPA
|
|
--------------------------------------
|
|
|
|
:Parameters: 64-bit base address
|
|
|
|
Returns:
|
|
|
|
======= ======================================
|
|
-ENXIO Stolen time not implemented
|
|
-EEXIST Base address already set for this VCPU
|
|
-EINVAL Base address not 64 byte aligned
|
|
======= ======================================
|
|
|
|
Specifies the base address of the stolen time structure for this VCPU. The
|
|
base address must be 64 byte aligned and exist within a valid guest memory
|
|
region. See Documentation/virt/kvm/arm/pvtime.rst for more information
|
|
including the layout of the stolen time structure.
|
|
|
|
4. GROUP: KVM_VCPU_TSC_CTRL
|
|
===========================
|
|
|
|
:Architectures: x86
|
|
|
|
4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
|
|
|
|
:Parameters: 64-bit unsigned TSC offset
|
|
|
|
Returns:
|
|
|
|
======= ======================================
|
|
-EFAULT Error reading/writing the provided
|
|
parameter address.
|
|
-ENXIO Attribute not supported
|
|
======= ======================================
|
|
|
|
Specifies the guest's TSC offset relative to the host's TSC. The guest's
|
|
TSC is then derived by the following equation:
|
|
|
|
guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
|
|
|
|
This attribute is useful to adjust the guest's TSC on live migration,
|
|
so that the TSC counts the time during which the VM was paused. The
|
|
following describes a possible algorithm to use for this purpose.
|
|
|
|
From the source VMM process:
|
|
|
|
1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
|
|
kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
|
|
(host_src).
|
|
|
|
2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
|
|
guest TSC offset (ofs_src[i]).
|
|
|
|
3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
|
|
guest's TSC (freq).
|
|
|
|
From the destination VMM process:
|
|
|
|
4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
|
|
kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
|
|
fields. Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
|
|
structure.
|
|
|
|
KVM will advance the VM's kvmclock to account for elapsed time since
|
|
recording the clock values. Note that this will cause problems in
|
|
the guest (e.g., timeouts) unless CLOCK_REALTIME is synchronized
|
|
between the source and destination, and a reasonably short time passes
|
|
between the source pausing the VMs and the destination executing
|
|
steps 4-7.
|
|
|
|
5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
|
|
kvmclock nanoseconds (guest_dest).
|
|
|
|
6. Adjust the guest TSC offsets for every vCPU to account for (1) time
|
|
elapsed since recording state and (2) difference in TSCs between the
|
|
source and destination machine:
|
|
|
|
ofs_dst[i] = ofs_src[i] -
|
|
(guest_src - guest_dest) * freq +
|
|
(tsc_src - tsc_dest)
|
|
|
|
("ofs[i] + tsc - guest * freq" is the guest TSC value corresponding to
|
|
a time of 0 in kvmclock. The above formula ensures that it is the
|
|
same on the destination as it was on the source).
|
|
|
|
7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
|
|
respective value derived in the previous step.
|