hyperv-fixes for v6.8
-----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmXlZ0gTHHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXnnUB/oCfw6GxsYL4eAiKcayrU0E7aDZbZzG wf/1m3fSiocERGEQqyU7s3ULoba/ejX09nTwV+ZwECbqat64ceUQb5ousf/3kn7i vg3kbPKmF79c2DNMnT5+K7gvmhyewm+5r8eCBsOegEqnXv0F3tGjq729Qe+5/SBB roP5XHjERpY5yHVsDNsTeQ1Qg+H/Mg/2eLAogSFtY0FXKfNrXXmMAuKwe7UJdWmd KIeSA4F18wsohtb4Aq8XLDG8UwmCUaBjzGsBOgjlVLtP2QxyfxswWludVK/fwyVl T/xcMW2ZZcK7mWRebqr9iritxbOls8ltbsY3fLENREJShs+JgLs19w8x =vyy7 -----END PGP SIGNATURE----- Merge tag 'hyperv-fixes-signed-20240303' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Multiple fixes, cleanups and documentations for Hyper-V core code and drivers * tag 'hyperv-fixes-signed-20240303' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Drivers: hv: vmbus: make hv_bus const x86/hyperv: Allow 15-bit APIC IDs for VTL platforms x86/hyperv: Make encrypted/decrypted changes safe for load_unaligned_zeropad() x86/mm: Regularize set_memory_p() parameters and make non-static x86/hyperv: Use slow_virt_to_phys() in page transition hypervisor callback Documentation: hyperv: Add overview of PCI pass-thru device support Drivers: hv: vmbus: Update indentation in create_gpadl_header() Drivers: hv: vmbus: Remove duplication and cleanup code in create_gpadl_header() fbdev/hyperv_fb: Fix logic error for Gen2 VMs in hvfb_getmem() Drivers: hv: vmbus: Calculate ring buffer size for more efficient use of memory hv_utils: Allow implicit ICTIMESYNCFLAG_SYNC
This commit is contained in:
commit
1c46d04a0d
@ -10,3 +10,4 @@ Hyper-V Enlightenments
|
||||
overview
|
||||
vmbus
|
||||
clocks
|
||||
vpci
|
||||
|
316
Documentation/virt/hyperv/vpci.rst
Normal file
316
Documentation/virt/hyperv/vpci.rst
Normal file
@ -0,0 +1,316 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
PCI pass-thru devices
|
||||
=========================
|
||||
In a Hyper-V guest VM, PCI pass-thru devices (also called
|
||||
virtual PCI devices, or vPCI devices) are physical PCI devices
|
||||
that are mapped directly into the VM's physical address space.
|
||||
Guest device drivers can interact directly with the hardware
|
||||
without intermediation by the host hypervisor. This approach
|
||||
provides higher bandwidth access to the device with lower
|
||||
latency, compared with devices that are virtualized by the
|
||||
hypervisor. The device should appear to the guest just as it
|
||||
would when running on bare metal, so no changes are required
|
||||
to the Linux device drivers for the device.
|
||||
|
||||
Hyper-V terminology for vPCI devices is "Discrete Device
|
||||
Assignment" (DDA). Public documentation for Hyper-V DDA is
|
||||
available here: `DDA`_
|
||||
|
||||
.. _DDA: https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment
|
||||
|
||||
DDA is typically used for storage controllers, such as NVMe,
|
||||
and for GPUs. A similar mechanism for NICs is called SR-IOV
|
||||
and produces the same benefits by allowing a guest device
|
||||
driver to interact directly with the hardware. See Hyper-V
|
||||
public documentation here: `SR-IOV`_
|
||||
|
||||
.. _SR-IOV: https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov-
|
||||
|
||||
This discussion of vPCI devices includes DDA and SR-IOV
|
||||
devices.
|
||||
|
||||
Device Presentation
|
||||
-------------------
|
||||
Hyper-V provides full PCI functionality for a vPCI device when
|
||||
it is operating, so the Linux device driver for the device can
|
||||
be used unchanged, provided it uses the correct Linux kernel
|
||||
APIs for accessing PCI config space and for other integration
|
||||
with Linux. But the initial detection of the PCI device and
|
||||
its integration with the Linux PCI subsystem must use Hyper-V
|
||||
specific mechanisms. Consequently, vPCI devices on Hyper-V
|
||||
have a dual identity. They are initially presented to Linux
|
||||
guests as VMBus devices via the standard VMBus "offer"
|
||||
mechanism, so they have a VMBus identity and appear under
|
||||
/sys/bus/vmbus/devices. The VMBus vPCI driver in Linux at
|
||||
drivers/pci/controller/pci-hyperv.c handles a newly introduced
|
||||
vPCI device by fabricating a PCI bus topology and creating all
|
||||
the normal PCI device data structures in Linux that would
|
||||
exist if the PCI device were discovered via ACPI on a bare-
|
||||
metal system. Once those data structures are set up, the
|
||||
device also has a normal PCI identity in Linux, and the normal
|
||||
Linux device driver for the vPCI device can function as if it
|
||||
were running in Linux on bare-metal. Because vPCI devices are
|
||||
presented dynamically through the VMBus offer mechanism, they
|
||||
do not appear in the Linux guest's ACPI tables. vPCI devices
|
||||
may be added to a VM or removed from a VM at any time during
|
||||
the life of the VM, and not just during initial boot.
|
||||
|
||||
With this approach, the vPCI device is a VMBus device and a
|
||||
PCI device at the same time. In response to the VMBus offer
|
||||
message, the hv_pci_probe() function runs and establishes a
|
||||
VMBus connection to the vPCI VSP on the Hyper-V host. That
|
||||
connection has a single VMBus channel. The channel is used to
|
||||
exchange messages with the vPCI VSP for the purpose of setting
|
||||
up and configuring the vPCI device in Linux. Once the device
|
||||
is fully configured in Linux as a PCI device, the VMBus
|
||||
channel is used only if Linux changes the vCPU to be interrupted
|
||||
in the guest, or if the vPCI device is removed from
|
||||
the VM while the VM is running. The ongoing operation of the
|
||||
device happens directly between the Linux device driver for
|
||||
the device and the hardware, with VMBus and the VMBus channel
|
||||
playing no role.
|
||||
|
||||
PCI Device Setup
|
||||
----------------
|
||||
PCI device setup follows a sequence that Hyper-V originally
|
||||
created for Windows guests, and that can be ill-suited for
|
||||
Linux guests due to differences in the overall structure of
|
||||
the Linux PCI subsystem compared with Windows. Nonetheless,
|
||||
with a bit of hackery in the Hyper-V virtual PCI driver for
|
||||
Linux, the virtual PCI device is setup in Linux so that
|
||||
generic Linux PCI subsystem code and the Linux driver for the
|
||||
device "just work".
|
||||
|
||||
Each vPCI device is set up in Linux to be in its own PCI
|
||||
domain with a host bridge. The PCI domainID is derived from
|
||||
bytes 4 and 5 of the instance GUID assigned to the VMBus vPCI
|
||||
device. The Hyper-V host does not guarantee that these bytes
|
||||
are unique, so hv_pci_probe() has an algorithm to resolve
|
||||
collisions. The collision resolution is intended to be stable
|
||||
across reboots of the same VM so that the PCI domainIDs don't
|
||||
change, as the domainID appears in the user space
|
||||
configuration of some devices.
|
||||
|
||||
hv_pci_probe() allocates a guest MMIO range to be used as PCI
|
||||
config space for the device. This MMIO range is communicated
|
||||
to the Hyper-V host over the VMBus channel as part of telling
|
||||
the host that the device is ready to enter d0. See
|
||||
hv_pci_enter_d0(). When the guest subsequently accesses this
|
||||
MMIO range, the Hyper-V host intercepts the accesses and maps
|
||||
them to the physical device PCI config space.
|
||||
|
||||
hv_pci_probe() also gets BAR information for the device from
|
||||
the Hyper-V host, and uses this information to allocate MMIO
|
||||
space for the BARs. That MMIO space is then setup to be
|
||||
associated with the host bridge so that it works when generic
|
||||
PCI subsystem code in Linux processes the BARs.
|
||||
|
||||
Finally, hv_pci_probe() creates the root PCI bus. At this
|
||||
point the Hyper-V virtual PCI driver hackery is done, and the
|
||||
normal Linux PCI machinery for scanning the root bus works to
|
||||
detect the device, to perform driver matching, and to
|
||||
initialize the driver and device.
|
||||
|
||||
PCI Device Removal
|
||||
------------------
|
||||
A Hyper-V host may initiate removal of a vPCI device from a
|
||||
guest VM at any time during the life of the VM. The removal
|
||||
is instigated by an admin action taken on the Hyper-V host and
|
||||
is not under the control of the guest OS.
|
||||
|
||||
A guest VM is notified of the removal by an unsolicited
|
||||
"Eject" message sent from the host to the guest over the VMBus
|
||||
channel associated with the vPCI device. Upon receipt of such
|
||||
a message, the Hyper-V virtual PCI driver in Linux
|
||||
asynchronously invokes Linux kernel PCI subsystem calls to
|
||||
shutdown and remove the device. When those calls are
|
||||
complete, an "Ejection Complete" message is sent back to
|
||||
Hyper-V over the VMBus channel indicating that the device has
|
||||
been removed. At this point, Hyper-V sends a VMBus rescind
|
||||
message to the Linux guest, which the VMBus driver in Linux
|
||||
processes by removing the VMBus identity for the device. Once
|
||||
that processing is complete, all vestiges of the device having
|
||||
been present are gone from the Linux kernel. The rescind
|
||||
message also indicates to the guest that Hyper-V has stopped
|
||||
providing support for the vPCI device in the guest. If the
|
||||
guest were to attempt to access that device's MMIO space, it
|
||||
would be an invalid reference. Hypercalls affecting the device
|
||||
return errors, and any further messages sent in the VMBus
|
||||
channel are ignored.
|
||||
|
||||
After sending the Eject message, Hyper-V allows the guest VM
|
||||
60 seconds to cleanly shutdown the device and respond with
|
||||
Ejection Complete before sending the VMBus rescind
|
||||
message. If for any reason the Eject steps don't complete
|
||||
within the allowed 60 seconds, the Hyper-V host forcibly
|
||||
performs the rescind steps, which will likely result in
|
||||
cascading errors in the guest because the device is now no
|
||||
longer present from the guest standpoint and accessing the
|
||||
device MMIO space will fail.
|
||||
|
||||
Because ejection is asynchronous and can happen at any point
|
||||
during the guest VM lifecycle, proper synchronization in the
|
||||
Hyper-V virtual PCI driver is very tricky. Ejection has been
|
||||
observed even before a newly offered vPCI device has been
|
||||
fully setup. The Hyper-V virtual PCI driver has been updated
|
||||
several times over the years to fix race conditions when
|
||||
ejections happen at inopportune times. Care must be taken when
|
||||
modifying this code to prevent re-introducing such problems.
|
||||
See comments in the code.
|
||||
|
||||
Interrupt Assignment
|
||||
--------------------
|
||||
The Hyper-V virtual PCI driver supports vPCI devices using
|
||||
MSI, multi-MSI, or MSI-X. Assigning the guest vCPU that will
|
||||
receive the interrupt for a particular MSI or MSI-X message is
|
||||
complex because of the way the Linux setup of IRQs maps onto
|
||||
the Hyper-V interfaces. For the single-MSI and MSI-X cases,
|
||||
Linux calls hv_compse_msi_msg() twice, with the first call
|
||||
containing a dummy vCPU and the second call containing the
|
||||
real vCPU. Furthermore, hv_irq_unmask() is finally called
|
||||
(on x86) or the GICD registers are set (on arm64) to specify
|
||||
the real vCPU again. Each of these three calls interact
|
||||
with Hyper-V, which must decide which physical CPU should
|
||||
receive the interrupt before it is forwarded to the guest VM.
|
||||
Unfortunately, the Hyper-V decision-making process is a bit
|
||||
limited, and can result in concentrating the physical
|
||||
interrupts on a single CPU, causing a performance bottleneck.
|
||||
See details about how this is resolved in the extensive
|
||||
comment above the function hv_compose_msi_req_get_cpu().
|
||||
|
||||
The Hyper-V virtual PCI driver implements the
|
||||
irq_chip.irq_compose_msi_msg function as hv_compose_msi_msg().
|
||||
Unfortunately, on Hyper-V the implementation requires sending
|
||||
a VMBus message to the Hyper-V host and awaiting an interrupt
|
||||
indicating receipt of a reply message. Since
|
||||
irq_chip.irq_compose_msi_msg can be called with IRQ locks
|
||||
held, it doesn't work to do the normal sleep until awakened by
|
||||
the interrupt. Instead hv_compose_msi_msg() must send the
|
||||
VMBus message, and then poll for the completion message. As
|
||||
further complexity, the vPCI device could be ejected/rescinded
|
||||
while the polling is in progress, so this scenario must be
|
||||
detected as well. See comments in the code regarding this
|
||||
very tricky area.
|
||||
|
||||
Most of the code in the Hyper-V virtual PCI driver (pci-
|
||||
hyperv.c) applies to Hyper-V and Linux guests running on x86
|
||||
and on arm64 architectures. But there are differences in how
|
||||
interrupt assignments are managed. On x86, the Hyper-V
|
||||
virtual PCI driver in the guest must make a hypercall to tell
|
||||
Hyper-V which guest vCPU should be interrupted by each
|
||||
MSI/MSI-X interrupt, and the x86 interrupt vector number that
|
||||
the x86_vector IRQ domain has picked for the interrupt. This
|
||||
hypercall is made by hv_arch_irq_unmask(). On arm64, the
|
||||
Hyper-V virtual PCI driver manages the allocation of an SPI
|
||||
for each MSI/MSI-X interrupt. The Hyper-V virtual PCI driver
|
||||
stores the allocated SPI in the architectural GICD registers,
|
||||
which Hyper-V emulates, so no hypercall is necessary as with
|
||||
x86. Hyper-V does not support using LPIs for vPCI devices in
|
||||
arm64 guest VMs because it does not emulate a GICv3 ITS.
|
||||
|
||||
The Hyper-V virtual PCI driver in Linux supports vPCI devices
|
||||
whose drivers create managed or unmanaged Linux IRQs. If the
|
||||
smp_affinity for an unmanaged IRQ is updated via the /proc/irq
|
||||
interface, the Hyper-V virtual PCI driver is called to tell
|
||||
the Hyper-V host to change the interrupt targeting and
|
||||
everything works properly. However, on x86 if the x86_vector
|
||||
IRQ domain needs to reassign an interrupt vector due to
|
||||
running out of vectors on a CPU, there's no path to inform the
|
||||
Hyper-V host of the change, and things break. Fortunately,
|
||||
guest VMs operate in a constrained device environment where
|
||||
using all the vectors on a CPU doesn't happen. Since such a
|
||||
problem is only a theoretical concern rather than a practical
|
||||
concern, it has been left unaddressed.
|
||||
|
||||
DMA
|
||||
---
|
||||
By default, Hyper-V pins all guest VM memory in the host
|
||||
when the VM is created, and programs the physical IOMMU to
|
||||
allow the VM to have DMA access to all its memory. Hence
|
||||
it is safe to assign PCI devices to the VM, and allow the
|
||||
guest operating system to program the DMA transfers. The
|
||||
physical IOMMU prevents a malicious guest from initiating
|
||||
DMA to memory belonging to the host or to other VMs on the
|
||||
host. From the Linux guest standpoint, such DMA transfers
|
||||
are in "direct" mode since Hyper-V does not provide a virtual
|
||||
IOMMU in the guest.
|
||||
|
||||
Hyper-V assumes that physical PCI devices always perform
|
||||
cache-coherent DMA. When running on x86, this behavior is
|
||||
required by the architecture. When running on arm64, the
|
||||
architecture allows for both cache-coherent and
|
||||
non-cache-coherent devices, with the behavior of each device
|
||||
specified in the ACPI DSDT. But when a PCI device is assigned
|
||||
to a guest VM, that device does not appear in the DSDT, so the
|
||||
Hyper-V VMBus driver propagates cache-coherency information
|
||||
from the VMBus node in the ACPI DSDT to all VMBus devices,
|
||||
including vPCI devices (since they have a dual identity as a VMBus
|
||||
device and as a PCI device). See vmbus_dma_configure().
|
||||
Current Hyper-V versions always indicate that the VMBus is
|
||||
cache coherent, so vPCI devices on arm64 always get marked as
|
||||
cache coherent and the CPU does not perform any sync
|
||||
operations as part of dma_map/unmap_*() calls.
|
||||
|
||||
vPCI protocol versions
|
||||
----------------------
|
||||
As previously described, during vPCI device setup and teardown
|
||||
messages are passed over a VMBus channel between the Hyper-V
|
||||
host and the Hyper-v vPCI driver in the Linux guest. Some
|
||||
messages have been revised in newer versions of Hyper-V, so
|
||||
the guest and host must agree on the vPCI protocol version to
|
||||
be used. The version is negotiated when communication over
|
||||
the VMBus channel is first established. See
|
||||
hv_pci_protocol_negotiation(). Newer versions of the protocol
|
||||
extend support to VMs with more than 64 vCPUs, and provide
|
||||
additional information about the vPCI device, such as the
|
||||
guest virtual NUMA node to which it is most closely affined in
|
||||
the underlying hardware.
|
||||
|
||||
Guest NUMA node affinity
|
||||
------------------------
|
||||
When the vPCI protocol version provides it, the guest NUMA
|
||||
node affinity of the vPCI device is stored as part of the Linux
|
||||
device information for subsequent use by the Linux driver. See
|
||||
hv_pci_assign_numa_node(). If the negotiated protocol version
|
||||
does not support the host providing NUMA affinity information,
|
||||
the Linux guest defaults the device NUMA node to 0. But even
|
||||
when the negotiated protocol version includes NUMA affinity
|
||||
information, the ability of the host to provide such
|
||||
information depends on certain host configuration options. If
|
||||
the guest receives NUMA node value "0", it could mean NUMA
|
||||
node 0, or it could mean "no information is available".
|
||||
Unfortunately it is not possible to distinguish the two cases
|
||||
from the guest side.
|
||||
|
||||
PCI config space access in a CoCo VM
|
||||
------------------------------------
|
||||
Linux PCI device drivers access PCI config space using a
|
||||
standard set of functions provided by the Linux PCI subsystem.
|
||||
In Hyper-V guests these standard functions map to functions
|
||||
hv_pcifront_read_config() and hv_pcifront_write_config()
|
||||
in the Hyper-V virtual PCI driver. In normal VMs,
|
||||
these hv_pcifront_*() functions directly access the PCI config
|
||||
space, and the accesses trap to Hyper-V to be handled.
|
||||
But in CoCo VMs, memory encryption prevents Hyper-V
|
||||
from reading the guest instruction stream to emulate the
|
||||
access, so the hv_pcifront_*() functions must invoke
|
||||
hypercalls with explicit arguments describing the access to be
|
||||
made.
|
||||
|
||||
Config Block back-channel
|
||||
-------------------------
|
||||
The Hyper-V host and Hyper-V virtual PCI driver in Linux
|
||||
together implement a non-standard back-channel communication
|
||||
path between the host and guest. The back-channel path uses
|
||||
messages sent over the VMBus channel associated with the vPCI
|
||||
device. The functions hyperv_read_cfg_blk() and
|
||||
hyperv_write_cfg_blk() are the primary interfaces provided to
|
||||
other parts of the Linux kernel. As of this writing, these
|
||||
interfaces are used only by the Mellanox mlx5 driver to pass
|
||||
diagnostic data to a Hyper-V host running in the Azure public
|
||||
cloud. The functions hyperv_read_cfg_blk() and
|
||||
hyperv_write_cfg_blk() are implemented in a separate module
|
||||
(pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_INTERFACE) that
|
||||
effectively stubs them out when running in non-Hyper-V
|
||||
environments.
|
@ -16,6 +16,11 @@
|
||||
extern struct boot_params boot_params;
|
||||
static struct real_mode_header hv_vtl_real_mode_header;
|
||||
|
||||
static bool __init hv_vtl_msi_ext_dest_id(void)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
void __init hv_vtl_init_platform(void)
|
||||
{
|
||||
pr_info("Linux runs in Hyper-V Virtual Trust Level\n");
|
||||
@ -38,6 +43,8 @@ void __init hv_vtl_init_platform(void)
|
||||
x86_platform.legacy.warm_reset = 0;
|
||||
x86_platform.legacy.reserve_bios_regions = 0;
|
||||
x86_platform.legacy.devices.pnpbios = 0;
|
||||
|
||||
x86_init.hyper.msi_ext_dest_id = hv_vtl_msi_ext_dest_id;
|
||||
}
|
||||
|
||||
static inline u64 hv_vtl_system_desc_base(struct ldttss_desc *desc)
|
||||
|
@ -15,6 +15,7 @@
|
||||
#include <asm/io.h>
|
||||
#include <asm/coco.h>
|
||||
#include <asm/mem_encrypt.h>
|
||||
#include <asm/set_memory.h>
|
||||
#include <asm/mshyperv.h>
|
||||
#include <asm/hypervisor.h>
|
||||
#include <asm/mtrr.h>
|
||||
@ -502,6 +503,31 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
|
||||
return -EFAULT;
|
||||
}
|
||||
|
||||
/*
|
||||
* When transitioning memory between encrypted and decrypted, the caller
|
||||
* of set_memory_encrypted() or set_memory_decrypted() is responsible for
|
||||
* ensuring that the memory isn't in use and isn't referenced while the
|
||||
* transition is in progress. The transition has multiple steps, and the
|
||||
* memory is in an inconsistent state until all steps are complete. A
|
||||
* reference while the state is inconsistent could result in an exception
|
||||
* that can't be cleanly fixed up.
|
||||
*
|
||||
* But the Linux kernel load_unaligned_zeropad() mechanism could cause a
|
||||
* stray reference that can't be prevented by the caller, so Linux has
|
||||
* specific code to handle this case. But when the #VC and #VE exceptions
|
||||
* routed to a paravisor, the specific code doesn't work. To avoid this
|
||||
* problem, mark the pages as "not present" while the transition is in
|
||||
* progress. If load_unaligned_zeropad() causes a stray reference, a normal
|
||||
* page fault is generated instead of #VC or #VE, and the page-fault-based
|
||||
* handlers for load_unaligned_zeropad() resolve the reference. When the
|
||||
* transition is complete, hv_vtom_set_host_visibility() marks the pages
|
||||
* as "present" again.
|
||||
*/
|
||||
static bool hv_vtom_clear_present(unsigned long kbuffer, int pagecount, bool enc)
|
||||
{
|
||||
return !set_memory_np(kbuffer, pagecount);
|
||||
}
|
||||
|
||||
/*
|
||||
* hv_vtom_set_host_visibility - Set specified memory visible to host.
|
||||
*
|
||||
@ -515,16 +541,28 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
|
||||
enum hv_mem_host_visibility visibility = enc ?
|
||||
VMBUS_PAGE_NOT_VISIBLE : VMBUS_PAGE_VISIBLE_READ_WRITE;
|
||||
u64 *pfn_array;
|
||||
phys_addr_t paddr;
|
||||
void *vaddr;
|
||||
int ret = 0;
|
||||
bool result = true;
|
||||
int i, pfn;
|
||||
|
||||
pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
|
||||
if (!pfn_array)
|
||||
return false;
|
||||
if (!pfn_array) {
|
||||
result = false;
|
||||
goto err_set_memory_p;
|
||||
}
|
||||
|
||||
for (i = 0, pfn = 0; i < pagecount; i++) {
|
||||
pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
|
||||
/*
|
||||
* Use slow_virt_to_phys() because the PRESENT bit has been
|
||||
* temporarily cleared in the PTEs. slow_virt_to_phys() works
|
||||
* without the PRESENT bit while virt_to_hvpfn() or similar
|
||||
* does not.
|
||||
*/
|
||||
vaddr = (void *)kbuffer + (i * HV_HYP_PAGE_SIZE);
|
||||
paddr = slow_virt_to_phys(vaddr);
|
||||
pfn_array[pfn] = paddr >> HV_HYP_PAGE_SHIFT;
|
||||
pfn++;
|
||||
|
||||
if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
|
||||
@ -538,14 +576,30 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
|
||||
}
|
||||
}
|
||||
|
||||
err_free_pfn_array:
|
||||
err_free_pfn_array:
|
||||
kfree(pfn_array);
|
||||
|
||||
err_set_memory_p:
|
||||
/*
|
||||
* Set the PTE PRESENT bits again to revert what hv_vtom_clear_present()
|
||||
* did. Do this even if there is an error earlier in this function in
|
||||
* order to avoid leaving the memory range in a "broken" state. Setting
|
||||
* the PRESENT bits shouldn't fail, but return an error if it does.
|
||||
*/
|
||||
if (set_memory_p(kbuffer, pagecount))
|
||||
result = false;
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
static bool hv_vtom_tlb_flush_required(bool private)
|
||||
{
|
||||
return true;
|
||||
/*
|
||||
* Since hv_vtom_clear_present() marks the PTEs as "not present"
|
||||
* and flushes the TLB, they can't be in the TLB. That makes the
|
||||
* flush controlled by this function redundant, so return "false".
|
||||
*/
|
||||
return false;
|
||||
}
|
||||
|
||||
static bool hv_vtom_cache_flush_required(void)
|
||||
@ -608,6 +662,7 @@ void __init hv_vtom_init(void)
|
||||
x86_platform.hyper.is_private_mmio = hv_is_private_mmio;
|
||||
x86_platform.guest.enc_cache_flush_required = hv_vtom_cache_flush_required;
|
||||
x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required;
|
||||
x86_platform.guest.enc_status_change_prepare = hv_vtom_clear_present;
|
||||
x86_platform.guest.enc_status_change_finish = hv_vtom_set_host_visibility;
|
||||
|
||||
/* Set WB as the default cache mode. */
|
||||
|
@ -47,6 +47,7 @@ int set_memory_uc(unsigned long addr, int numpages);
|
||||
int set_memory_wc(unsigned long addr, int numpages);
|
||||
int set_memory_wb(unsigned long addr, int numpages);
|
||||
int set_memory_np(unsigned long addr, int numpages);
|
||||
int set_memory_p(unsigned long addr, int numpages);
|
||||
int set_memory_4k(unsigned long addr, int numpages);
|
||||
int set_memory_encrypted(unsigned long addr, int numpages);
|
||||
int set_memory_decrypted(unsigned long addr, int numpages);
|
||||
|
@ -755,10 +755,14 @@ pmd_t *lookup_pmd_address(unsigned long address)
|
||||
* areas on 32-bit NUMA systems. The percpu areas can
|
||||
* end up in this kind of memory, for instance.
|
||||
*
|
||||
* This could be optimized, but it is only intended to be
|
||||
* used at initialization time, and keeping it
|
||||
* unoptimized should increase the testing coverage for
|
||||
* the more obscure platforms.
|
||||
* Note that as long as the PTEs are well-formed with correct PFNs, this
|
||||
* works without checking the PRESENT bit in the leaf PTE. This is unlike
|
||||
* the similar vmalloc_to_page() and derivatives. Callers may depend on
|
||||
* this behavior.
|
||||
*
|
||||
* This could be optimized, but it is only used in paths that are not perf
|
||||
* sensitive, and keeping it unoptimized should increase the testing coverage
|
||||
* for the more obscure platforms.
|
||||
*/
|
||||
phys_addr_t slow_virt_to_phys(void *__virt_addr)
|
||||
{
|
||||
@ -2041,17 +2045,12 @@ int set_mce_nospec(unsigned long pfn)
|
||||
return rc;
|
||||
}
|
||||
|
||||
static int set_memory_p(unsigned long *addr, int numpages)
|
||||
{
|
||||
return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
|
||||
}
|
||||
|
||||
/* Restore full speculative operation to the pfn. */
|
||||
int clear_mce_nospec(unsigned long pfn)
|
||||
{
|
||||
unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
|
||||
|
||||
return set_memory_p(&addr, 1);
|
||||
return set_memory_p(addr, 1);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(clear_mce_nospec);
|
||||
#endif /* CONFIG_X86_64 */
|
||||
@ -2104,6 +2103,11 @@ int set_memory_np_noalias(unsigned long addr, int numpages)
|
||||
CPA_NO_CHECK_ALIAS, NULL);
|
||||
}
|
||||
|
||||
int set_memory_p(unsigned long addr, int numpages)
|
||||
{
|
||||
return change_page_attr_set(&addr, numpages, __pgprot(_PAGE_PRESENT), 0);
|
||||
}
|
||||
|
||||
int set_memory_4k(unsigned long addr, int numpages)
|
||||
{
|
||||
return change_page_attr_set_clr(&addr, numpages, __pgprot(0),
|
||||
|
@ -322,125 +322,89 @@ static int create_gpadl_header(enum hv_gpadl_type type, void *kbuffer,
|
||||
|
||||
pagecount = hv_gpadl_size(type, size) >> HV_HYP_PAGE_SHIFT;
|
||||
|
||||
/* do we need a gpadl body msg */
|
||||
pfnsize = MAX_SIZE_CHANNEL_MESSAGE -
|
||||
sizeof(struct vmbus_channel_gpadl_header) -
|
||||
sizeof(struct gpa_range);
|
||||
pfncount = umin(pagecount, pfnsize / sizeof(u64));
|
||||
|
||||
msgsize = sizeof(struct vmbus_channel_msginfo) +
|
||||
sizeof(struct vmbus_channel_gpadl_header) +
|
||||
sizeof(struct gpa_range) + pfncount * sizeof(u64);
|
||||
msgheader = kzalloc(msgsize, GFP_KERNEL);
|
||||
if (!msgheader)
|
||||
return -ENOMEM;
|
||||
|
||||
INIT_LIST_HEAD(&msgheader->submsglist);
|
||||
msgheader->msgsize = msgsize;
|
||||
|
||||
gpadl_header = (struct vmbus_channel_gpadl_header *)
|
||||
msgheader->msg;
|
||||
gpadl_header->rangecount = 1;
|
||||
gpadl_header->range_buflen = sizeof(struct gpa_range) +
|
||||
pagecount * sizeof(u64);
|
||||
gpadl_header->range[0].byte_offset = 0;
|
||||
gpadl_header->range[0].byte_count = hv_gpadl_size(type, size);
|
||||
for (i = 0; i < pfncount; i++)
|
||||
gpadl_header->range[0].pfn_array[i] = hv_gpadl_hvpfn(
|
||||
type, kbuffer, size, send_offset, i);
|
||||
*msginfo = msgheader;
|
||||
|
||||
pfnsum = pfncount;
|
||||
pfnleft = pagecount - pfncount;
|
||||
|
||||
/* how many pfns can we fit in a body message */
|
||||
pfnsize = MAX_SIZE_CHANNEL_MESSAGE -
|
||||
sizeof(struct vmbus_channel_gpadl_body);
|
||||
pfncount = pfnsize / sizeof(u64);
|
||||
|
||||
if (pagecount > pfncount) {
|
||||
/* we need a gpadl body */
|
||||
/* fill in the header */
|
||||
/*
|
||||
* If pfnleft is zero, everything fits in the header and no body
|
||||
* messages are needed
|
||||
*/
|
||||
while (pfnleft) {
|
||||
pfncurr = umin(pfncount, pfnleft);
|
||||
msgsize = sizeof(struct vmbus_channel_msginfo) +
|
||||
sizeof(struct vmbus_channel_gpadl_header) +
|
||||
sizeof(struct gpa_range) + pfncount * sizeof(u64);
|
||||
msgheader = kzalloc(msgsize, GFP_KERNEL);
|
||||
if (!msgheader)
|
||||
goto nomem;
|
||||
|
||||
INIT_LIST_HEAD(&msgheader->submsglist);
|
||||
msgheader->msgsize = msgsize;
|
||||
|
||||
gpadl_header = (struct vmbus_channel_gpadl_header *)
|
||||
msgheader->msg;
|
||||
gpadl_header->rangecount = 1;
|
||||
gpadl_header->range_buflen = sizeof(struct gpa_range) +
|
||||
pagecount * sizeof(u64);
|
||||
gpadl_header->range[0].byte_offset = 0;
|
||||
gpadl_header->range[0].byte_count = hv_gpadl_size(type, size);
|
||||
for (i = 0; i < pfncount; i++)
|
||||
gpadl_header->range[0].pfn_array[i] = hv_gpadl_hvpfn(
|
||||
type, kbuffer, size, send_offset, i);
|
||||
*msginfo = msgheader;
|
||||
|
||||
pfnsum = pfncount;
|
||||
pfnleft = pagecount - pfncount;
|
||||
|
||||
/* how many pfns can we fit */
|
||||
pfnsize = MAX_SIZE_CHANNEL_MESSAGE -
|
||||
sizeof(struct vmbus_channel_gpadl_body);
|
||||
pfncount = pfnsize / sizeof(u64);
|
||||
|
||||
/* fill in the body */
|
||||
while (pfnleft) {
|
||||
if (pfnleft > pfncount)
|
||||
pfncurr = pfncount;
|
||||
else
|
||||
pfncurr = pfnleft;
|
||||
|
||||
msgsize = sizeof(struct vmbus_channel_msginfo) +
|
||||
sizeof(struct vmbus_channel_gpadl_body) +
|
||||
pfncurr * sizeof(u64);
|
||||
msgbody = kzalloc(msgsize, GFP_KERNEL);
|
||||
|
||||
if (!msgbody) {
|
||||
struct vmbus_channel_msginfo *pos = NULL;
|
||||
struct vmbus_channel_msginfo *tmp = NULL;
|
||||
/*
|
||||
* Free up all the allocated messages.
|
||||
*/
|
||||
list_for_each_entry_safe(pos, tmp,
|
||||
&msgheader->submsglist,
|
||||
msglistentry) {
|
||||
|
||||
list_del(&pos->msglistentry);
|
||||
kfree(pos);
|
||||
}
|
||||
|
||||
goto nomem;
|
||||
}
|
||||
|
||||
msgbody->msgsize = msgsize;
|
||||
gpadl_body =
|
||||
(struct vmbus_channel_gpadl_body *)msgbody->msg;
|
||||
sizeof(struct vmbus_channel_gpadl_body) +
|
||||
pfncurr * sizeof(u64);
|
||||
msgbody = kzalloc(msgsize, GFP_KERNEL);
|
||||
|
||||
if (!msgbody) {
|
||||
struct vmbus_channel_msginfo *pos = NULL;
|
||||
struct vmbus_channel_msginfo *tmp = NULL;
|
||||
/*
|
||||
* Gpadl is u32 and we are using a pointer which could
|
||||
* be 64-bit
|
||||
* This is governed by the guest/host protocol and
|
||||
* so the hypervisor guarantees that this is ok.
|
||||
* Free up all the allocated messages.
|
||||
*/
|
||||
for (i = 0; i < pfncurr; i++)
|
||||
gpadl_body->pfn[i] = hv_gpadl_hvpfn(type,
|
||||
kbuffer, size, send_offset, pfnsum + i);
|
||||
list_for_each_entry_safe(pos, tmp,
|
||||
&msgheader->submsglist,
|
||||
msglistentry) {
|
||||
|
||||
/* add to msg header */
|
||||
list_add_tail(&msgbody->msglistentry,
|
||||
&msgheader->submsglist);
|
||||
pfnsum += pfncurr;
|
||||
pfnleft -= pfncurr;
|
||||
list_del(&pos->msglistentry);
|
||||
kfree(pos);
|
||||
}
|
||||
kfree(msgheader);
|
||||
return -ENOMEM;
|
||||
}
|
||||
} else {
|
||||
/* everything fits in a header */
|
||||
msgsize = sizeof(struct vmbus_channel_msginfo) +
|
||||
sizeof(struct vmbus_channel_gpadl_header) +
|
||||
sizeof(struct gpa_range) + pagecount * sizeof(u64);
|
||||
msgheader = kzalloc(msgsize, GFP_KERNEL);
|
||||
if (msgheader == NULL)
|
||||
goto nomem;
|
||||
|
||||
INIT_LIST_HEAD(&msgheader->submsglist);
|
||||
msgheader->msgsize = msgsize;
|
||||
msgbody->msgsize = msgsize;
|
||||
gpadl_body = (struct vmbus_channel_gpadl_body *)msgbody->msg;
|
||||
|
||||
gpadl_header = (struct vmbus_channel_gpadl_header *)
|
||||
msgheader->msg;
|
||||
gpadl_header->rangecount = 1;
|
||||
gpadl_header->range_buflen = sizeof(struct gpa_range) +
|
||||
pagecount * sizeof(u64);
|
||||
gpadl_header->range[0].byte_offset = 0;
|
||||
gpadl_header->range[0].byte_count = hv_gpadl_size(type, size);
|
||||
for (i = 0; i < pagecount; i++)
|
||||
gpadl_header->range[0].pfn_array[i] = hv_gpadl_hvpfn(
|
||||
type, kbuffer, size, send_offset, i);
|
||||
/*
|
||||
* Gpadl is u32 and we are using a pointer which could
|
||||
* be 64-bit
|
||||
* This is governed by the guest/host protocol and
|
||||
* so the hypervisor guarantees that this is ok.
|
||||
*/
|
||||
for (i = 0; i < pfncurr; i++)
|
||||
gpadl_body->pfn[i] = hv_gpadl_hvpfn(type,
|
||||
kbuffer, size, send_offset, pfnsum + i);
|
||||
|
||||
*msginfo = msgheader;
|
||||
/* add to msg header */
|
||||
list_add_tail(&msgbody->msglistentry, &msgheader->submsglist);
|
||||
pfnsum += pfncurr;
|
||||
pfnleft -= pfncurr;
|
||||
}
|
||||
|
||||
return 0;
|
||||
nomem:
|
||||
kfree(msgheader);
|
||||
kfree(msgbody);
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
/*
|
||||
|
@ -296,6 +296,11 @@ static struct {
|
||||
spinlock_t lock;
|
||||
} host_ts;
|
||||
|
||||
static bool timesync_implicit;
|
||||
|
||||
module_param(timesync_implicit, bool, 0644);
|
||||
MODULE_PARM_DESC(timesync_implicit, "If set treat SAMPLE as SYNC when clock is behind");
|
||||
|
||||
static inline u64 reftime_to_ns(u64 reftime)
|
||||
{
|
||||
return (reftime - WLTIMEDELTA) * 100;
|
||||
@ -344,6 +349,29 @@ static void hv_set_host_time(struct work_struct *work)
|
||||
do_settimeofday64(&ts);
|
||||
}
|
||||
|
||||
/*
|
||||
* Due to a bug on Hyper-V hosts, the sync flag may not always be sent on resume.
|
||||
* Force a sync if the guest is behind.
|
||||
*/
|
||||
static inline bool hv_implicit_sync(u64 host_time)
|
||||
{
|
||||
struct timespec64 new_ts;
|
||||
struct timespec64 threshold_ts;
|
||||
|
||||
new_ts = ns_to_timespec64(reftime_to_ns(host_time));
|
||||
ktime_get_real_ts64(&threshold_ts);
|
||||
|
||||
threshold_ts.tv_sec += 5;
|
||||
|
||||
/*
|
||||
* If guest behind the host by 5 or more seconds.
|
||||
*/
|
||||
if (timespec64_compare(&new_ts, &threshold_ts) >= 0)
|
||||
return true;
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
/*
|
||||
* Synchronize time with host after reboot, restore, etc.
|
||||
*
|
||||
@ -384,7 +412,8 @@ static inline void adj_guesttime(u64 hosttime, u64 reftime, u8 adj_flags)
|
||||
spin_unlock_irqrestore(&host_ts.lock, flags);
|
||||
|
||||
/* Schedule work to do do_settimeofday64() */
|
||||
if (adj_flags & ICTIMESYNCFLAG_SYNC)
|
||||
if ((adj_flags & ICTIMESYNCFLAG_SYNC) ||
|
||||
(timesync_implicit && hv_implicit_sync(host_ts.host_time)))
|
||||
schedule_work(&adj_time_work);
|
||||
}
|
||||
|
||||
|
@ -988,7 +988,7 @@ static const struct dev_pm_ops vmbus_pm = {
|
||||
};
|
||||
|
||||
/* The one and only one */
|
||||
static struct bus_type hv_bus = {
|
||||
static const struct bus_type hv_bus = {
|
||||
.name = "vmbus",
|
||||
.match = vmbus_match,
|
||||
.shutdown = vmbus_shutdown,
|
||||
|
@ -1010,8 +1010,6 @@ static int hvfb_getmem(struct hv_device *hdev, struct fb_info *info)
|
||||
goto getmem_done;
|
||||
}
|
||||
pr_info("Unable to allocate enough contiguous physical memory on Gen 1 VM. Using MMIO instead.\n");
|
||||
} else {
|
||||
goto err1;
|
||||
}
|
||||
|
||||
/*
|
||||
|
@ -164,8 +164,28 @@ struct hv_ring_buffer {
|
||||
u8 buffer[];
|
||||
} __packed;
|
||||
|
||||
|
||||
/*
|
||||
* If the requested ring buffer size is at least 8 times the size of the
|
||||
* header, steal space from the ring buffer for the header. Otherwise, add
|
||||
* space for the header so that is doesn't take too much of the ring buffer
|
||||
* space.
|
||||
*
|
||||
* The factor of 8 is somewhat arbitrary. The goal is to prevent adding a
|
||||
* relatively small header (4 Kbytes on x86) to a large-ish power-of-2 ring
|
||||
* buffer size (such as 128 Kbytes) and so end up making a nearly twice as
|
||||
* large allocation that will be almost half wasted. As a contrasting example,
|
||||
* on ARM64 with 64 Kbyte page size, we don't want to take 64 Kbytes for the
|
||||
* header from a 128 Kbyte allocation, leaving only 64 Kbytes for the ring.
|
||||
* In this latter case, we must add 64 Kbytes for the header and not worry
|
||||
* about what's wasted.
|
||||
*/
|
||||
#define VMBUS_HEADER_ADJ(payload_sz) \
|
||||
((payload_sz) >= 8 * sizeof(struct hv_ring_buffer) ? \
|
||||
0 : sizeof(struct hv_ring_buffer))
|
||||
|
||||
/* Calculate the proper size of a ringbuffer, it must be page-aligned */
|
||||
#define VMBUS_RING_SIZE(payload_sz) PAGE_ALIGN(sizeof(struct hv_ring_buffer) + \
|
||||
#define VMBUS_RING_SIZE(payload_sz) PAGE_ALIGN(VMBUS_HEADER_ADJ(payload_sz) + \
|
||||
(payload_sz))
|
||||
|
||||
struct hv_ring_buffer_info {
|
||||
|
Loading…
Reference in New Issue
Block a user