hyperv-fixes for v6.8

-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmXlZ0gTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXnnUB/oCfw6GxsYL4eAiKcayrU0E7aDZbZzG
 wf/1m3fSiocERGEQqyU7s3ULoba/ejX09nTwV+ZwECbqat64ceUQb5ousf/3kn7i
 vg3kbPKmF79c2DNMnT5+K7gvmhyewm+5r8eCBsOegEqnXv0F3tGjq729Qe+5/SBB
 roP5XHjERpY5yHVsDNsTeQ1Qg+H/Mg/2eLAogSFtY0FXKfNrXXmMAuKwe7UJdWmd
 KIeSA4F18wsohtb4Aq8XLDG8UwmCUaBjzGsBOgjlVLtP2QxyfxswWludVK/fwyVl
 T/xcMW2ZZcK7mWRebqr9iritxbOls8ltbsY3fLENREJShs+JgLs19w8x
 =vyy7
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed-20240303' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - Multiple fixes, cleanups and documentations for Hyper-V core code and
   drivers

* tag 'hyperv-fixes-signed-20240303' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Drivers: hv: vmbus: make hv_bus const
  x86/hyperv: Allow 15-bit APIC IDs for VTL platforms
  x86/hyperv: Make encrypted/decrypted changes safe for load_unaligned_zeropad()
  x86/mm: Regularize set_memory_p() parameters and make non-static
  x86/hyperv: Use slow_virt_to_phys() in page transition hypervisor callback
  Documentation: hyperv: Add overview of PCI pass-thru device support
  Drivers: hv: vmbus: Update indentation in create_gpadl_header()
  Drivers: hv: vmbus: Remove duplication and cleanup code in create_gpadl_header()
  fbdev/hyperv_fb: Fix logic error for Gen2 VMs in hvfb_getmem()
  Drivers: hv: vmbus: Calculate ring buffer size for more efficient use of memory
  hv_utils: Allow implicit ICTIMESYNCFLAG_SYNC
This commit is contained in:
Linus Torvalds 2024-03-05 12:38:50 -08:00
commit 1c46d04a0d
11 changed files with 517 additions and 122 deletions

View File

@ -10,3 +10,4 @@ Hyper-V Enlightenments
overview
vmbus
clocks
vpci

View File

@ -0,0 +1,316 @@
.. SPDX-License-Identifier: GPL-2.0
PCI pass-thru devices
=========================
In a Hyper-V guest VM, PCI pass-thru devices (also called
virtual PCI devices, or vPCI devices) are physical PCI devices
that are mapped directly into the VM's physical address space.
Guest device drivers can interact directly with the hardware
without intermediation by the host hypervisor. This approach
provides higher bandwidth access to the device with lower
latency, compared with devices that are virtualized by the
hypervisor. The device should appear to the guest just as it
would when running on bare metal, so no changes are required
to the Linux device drivers for the device.
Hyper-V terminology for vPCI devices is "Discrete Device
Assignment" (DDA). Public documentation for Hyper-V DDA is
available here: `DDA`_
.. _DDA: https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment
DDA is typically used for storage controllers, such as NVMe,
and for GPUs. A similar mechanism for NICs is called SR-IOV
and produces the same benefits by allowing a guest device
driver to interact directly with the hardware. See Hyper-V
public documentation here: `SR-IOV`_
.. _SR-IOV: https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov-
This discussion of vPCI devices includes DDA and SR-IOV
devices.
Device Presentation
-------------------
Hyper-V provides full PCI functionality for a vPCI device when
it is operating, so the Linux device driver for the device can
be used unchanged, provided it uses the correct Linux kernel
APIs for accessing PCI config space and for other integration
with Linux. But the initial detection of the PCI device and
its integration with the Linux PCI subsystem must use Hyper-V
specific mechanisms. Consequently, vPCI devices on Hyper-V
have a dual identity. They are initially presented to Linux
guests as VMBus devices via the standard VMBus "offer"
mechanism, so they have a VMBus identity and appear under
/sys/bus/vmbus/devices. The VMBus vPCI driver in Linux at
drivers/pci/controller/pci-hyperv.c handles a newly introduced
vPCI device by fabricating a PCI bus topology and creating all
the normal PCI device data structures in Linux that would
exist if the PCI device were discovered via ACPI on a bare-
metal system. Once those data structures are set up, the
device also has a normal PCI identity in Linux, and the normal
Linux device driver for the vPCI device can function as if it
were running in Linux on bare-metal. Because vPCI devices are
presented dynamically through the VMBus offer mechanism, they
do not appear in the Linux guest's ACPI tables. vPCI devices
may be added to a VM or removed from a VM at any time during
the life of the VM, and not just during initial boot.
With this approach, the vPCI device is a VMBus device and a
PCI device at the same time. In response to the VMBus offer
message, the hv_pci_probe() function runs and establishes a
VMBus connection to the vPCI VSP on the Hyper-V host. That
connection has a single VMBus channel. The channel is used to
exchange messages with the vPCI VSP for the purpose of setting
up and configuring the vPCI device in Linux. Once the device
is fully configured in Linux as a PCI device, the VMBus
channel is used only if Linux changes the vCPU to be interrupted
in the guest, or if the vPCI device is removed from
the VM while the VM is running. The ongoing operation of the
device happens directly between the Linux device driver for
the device and the hardware, with VMBus and the VMBus channel
playing no role.
PCI Device Setup
----------------
PCI device setup follows a sequence that Hyper-V originally
created for Windows guests, and that can be ill-suited for
Linux guests due to differences in the overall structure of
the Linux PCI subsystem compared with Windows. Nonetheless,
with a bit of hackery in the Hyper-V virtual PCI driver for
Linux, the virtual PCI device is setup in Linux so that
generic Linux PCI subsystem code and the Linux driver for the
device "just work".
Each vPCI device is set up in Linux to be in its own PCI
domain with a host bridge. The PCI domainID is derived from
bytes 4 and 5 of the instance GUID assigned to the VMBus vPCI
device. The Hyper-V host does not guarantee that these bytes
are unique, so hv_pci_probe() has an algorithm to resolve
collisions. The collision resolution is intended to be stable
across reboots of the same VM so that the PCI domainIDs don't
change, as the domainID appears in the user space
configuration of some devices.
hv_pci_probe() allocates a guest MMIO range to be used as PCI
config space for the device. This MMIO range is communicated
to the Hyper-V host over the VMBus channel as part of telling
the host that the device is ready to enter d0. See
hv_pci_enter_d0(). When the guest subsequently accesses this
MMIO range, the Hyper-V host intercepts the accesses and maps
them to the physical device PCI config space.
hv_pci_probe() also gets BAR information for the device from
the Hyper-V host, and uses this information to allocate MMIO
space for the BARs. That MMIO space is then setup to be
associated with the host bridge so that it works when generic
PCI subsystem code in Linux processes the BARs.
Finally, hv_pci_probe() creates the root PCI bus. At this
point the Hyper-V virtual PCI driver hackery is done, and the
normal Linux PCI machinery for scanning the root bus works to
detect the device, to perform driver matching, and to
initialize the driver and device.
PCI Device Removal
------------------
A Hyper-V host may initiate removal of a vPCI device from a
guest VM at any time during the life of the VM. The removal
is instigated by an admin action taken on the Hyper-V host and
is not under the control of the guest OS.
A guest VM is notified of the removal by an unsolicited
"Eject" message sent from the host to the guest over the VMBus
channel associated with the vPCI device. Upon receipt of such
a message, the Hyper-V virtual PCI driver in Linux
asynchronously invokes Linux kernel PCI subsystem calls to
shutdown and remove the device. When those calls are
complete, an "Ejection Complete" message is sent back to
Hyper-V over the VMBus channel indicating that the device has
been removed. At this point, Hyper-V sends a VMBus rescind
message to the Linux guest, which the VMBus driver in Linux
processes by removing the VMBus identity for the device. Once
that processing is complete, all vestiges of the device having
been present are gone from the Linux kernel. The rescind
message also indicates to the guest that Hyper-V has stopped
providing support for the vPCI device in the guest. If the
guest were to attempt to access that device's MMIO space, it
would be an invalid reference. Hypercalls affecting the device
return errors, and any further messages sent in the VMBus
channel are ignored.
After sending the Eject message, Hyper-V allows the guest VM
60 seconds to cleanly shutdown the device and respond with
Ejection Complete before sending the VMBus rescind
message. If for any reason the Eject steps don't complete
within the allowed 60 seconds, the Hyper-V host forcibly
performs the rescind steps, which will likely result in
cascading errors in the guest because the device is now no
longer present from the guest standpoint and accessing the
device MMIO space will fail.
Because ejection is asynchronous and can happen at any point
during the guest VM lifecycle, proper synchronization in the
Hyper-V virtual PCI driver is very tricky. Ejection has been
observed even before a newly offered vPCI device has been
fully setup. The Hyper-V virtual PCI driver has been updated
several times over the years to fix race conditions when
ejections happen at inopportune times. Care must be taken when
modifying this code to prevent re-introducing such problems.
See comments in the code.
Interrupt Assignment
--------------------
The Hyper-V virtual PCI driver supports vPCI devices using
MSI, multi-MSI, or MSI-X. Assigning the guest vCPU that will
receive the interrupt for a particular MSI or MSI-X message is
complex because of the way the Linux setup of IRQs maps onto
the Hyper-V interfaces. For the single-MSI and MSI-X cases,
Linux calls hv_compse_msi_msg() twice, with the first call
containing a dummy vCPU and the second call containing the
real vCPU. Furthermore, hv_irq_unmask() is finally called
(on x86) or the GICD registers are set (on arm64) to specify
the real vCPU again. Each of these three calls interact
with Hyper-V, which must decide which physical CPU should
receive the interrupt before it is forwarded to the guest VM.
Unfortunately, the Hyper-V decision-making process is a bit
limited, and can result in concentrating the physical
interrupts on a single CPU, causing a performance bottleneck.
See details about how this is resolved in the extensive
comment above the function hv_compose_msi_req_get_cpu().
The Hyper-V virtual PCI driver implements the
irq_chip.irq_compose_msi_msg function as hv_compose_msi_msg().
Unfortunately, on Hyper-V the implementation requires sending
a VMBus message to the Hyper-V host and awaiting an interrupt
indicating receipt of a reply message. Since
irq_chip.irq_compose_msi_msg can be called with IRQ locks
held, it doesn't work to do the normal sleep until awakened by
the interrupt. Instead hv_compose_msi_msg() must send the
VMBus message, and then poll for the completion message. As
further complexity, the vPCI device could be ejected/rescinded
while the polling is in progress, so this scenario must be
detected as well. See comments in the code regarding this
very tricky area.
Most of the code in the Hyper-V virtual PCI driver (pci-
hyperv.c) applies to Hyper-V and Linux guests running on x86
and on arm64 architectures. But there are differences in how
interrupt assignments are managed. On x86, the Hyper-V
virtual PCI driver in the guest must make a hypercall to tell
Hyper-V which guest vCPU should be interrupted by each
MSI/MSI-X interrupt, and the x86 interrupt vector number that
the x86_vector IRQ domain has picked for the interrupt. This
hypercall is made by hv_arch_irq_unmask(). On arm64, the
Hyper-V virtual PCI driver manages the allocation of an SPI
for each MSI/MSI-X interrupt. The Hyper-V virtual PCI driver
stores the allocated SPI in the architectural GICD registers,
which Hyper-V emulates, so no hypercall is necessary as with
x86. Hyper-V does not support using LPIs for vPCI devices in
arm64 guest VMs because it does not emulate a GICv3 ITS.
The Hyper-V virtual PCI driver in Linux supports vPCI devices
whose drivers create managed or unmanaged Linux IRQs. If the
smp_affinity for an unmanaged IRQ is updated via the /proc/irq
interface, the Hyper-V virtual PCI driver is called to tell
the Hyper-V host to change the interrupt targeting and
everything works properly. However, on x86 if the x86_vector
IRQ domain needs to reassign an interrupt vector due to
running out of vectors on a CPU, there's no path to inform the
Hyper-V host of the change, and things break. Fortunately,
guest VMs operate in a constrained device environment where
using all the vectors on a CPU doesn't happen. Since such a
problem is only a theoretical concern rather than a practical
concern, it has been left unaddressed.
DMA
---
By default, Hyper-V pins all guest VM memory in the host
when the VM is created, and programs the physical IOMMU to
allow the VM to have DMA access to all its memory. Hence
it is safe to assign PCI devices to the VM, and allow the
guest operating system to program the DMA transfers. The
physical IOMMU prevents a malicious guest from initiating
DMA to memory belonging to the host or to other VMs on the
host. From the Linux guest standpoint, such DMA transfers
are in "direct" mode since Hyper-V does not provide a virtual
IOMMU in the guest.
Hyper-V assumes that physical PCI devices always perform
cache-coherent DMA. When running on x86, this behavior is
required by the architecture. When running on arm64, the
architecture allows for both cache-coherent and
non-cache-coherent devices, with the behavior of each device
specified in the ACPI DSDT. But when a PCI device is assigned
to a guest VM, that device does not appear in the DSDT, so the
Hyper-V VMBus driver propagates cache-coherency information
from the VMBus node in the ACPI DSDT to all VMBus devices,
including vPCI devices (since they have a dual identity as a VMBus
device and as a PCI device). See vmbus_dma_configure().
Current Hyper-V versions always indicate that the VMBus is
cache coherent, so vPCI devices on arm64 always get marked as
cache coherent and the CPU does not perform any sync
operations as part of dma_map/unmap_*() calls.
vPCI protocol versions
----------------------
As previously described, during vPCI device setup and teardown
messages are passed over a VMBus channel between the Hyper-V
host and the Hyper-v vPCI driver in the Linux guest. Some
messages have been revised in newer versions of Hyper-V, so
the guest and host must agree on the vPCI protocol version to
be used. The version is negotiated when communication over
the VMBus channel is first established. See
hv_pci_protocol_negotiation(). Newer versions of the protocol
extend support to VMs with more than 64 vCPUs, and provide
additional information about the vPCI device, such as the
guest virtual NUMA node to which it is most closely affined in
the underlying hardware.
Guest NUMA node affinity
------------------------
When the vPCI protocol version provides it, the guest NUMA
node affinity of the vPCI device is stored as part of the Linux
device information for subsequent use by the Linux driver. See
hv_pci_assign_numa_node(). If the negotiated protocol version
does not support the host providing NUMA affinity information,
the Linux guest defaults the device NUMA node to 0. But even
when the negotiated protocol version includes NUMA affinity
information, the ability of the host to provide such
information depends on certain host configuration options. If
the guest receives NUMA node value "0", it could mean NUMA
node 0, or it could mean "no information is available".
Unfortunately it is not possible to distinguish the two cases
from the guest side.
PCI config space access in a CoCo VM
------------------------------------
Linux PCI device drivers access PCI config space using a
standard set of functions provided by the Linux PCI subsystem.
In Hyper-V guests these standard functions map to functions
hv_pcifront_read_config() and hv_pcifront_write_config()
in the Hyper-V virtual PCI driver. In normal VMs,
these hv_pcifront_*() functions directly access the PCI config
space, and the accesses trap to Hyper-V to be handled.
But in CoCo VMs, memory encryption prevents Hyper-V
from reading the guest instruction stream to emulate the
access, so the hv_pcifront_*() functions must invoke
hypercalls with explicit arguments describing the access to be
made.
Config Block back-channel
-------------------------
The Hyper-V host and Hyper-V virtual PCI driver in Linux
together implement a non-standard back-channel communication
path between the host and guest. The back-channel path uses
messages sent over the VMBus channel associated with the vPCI
device. The functions hyperv_read_cfg_blk() and
hyperv_write_cfg_blk() are the primary interfaces provided to
other parts of the Linux kernel. As of this writing, these
interfaces are used only by the Mellanox mlx5 driver to pass
diagnostic data to a Hyper-V host running in the Azure public
cloud. The functions hyperv_read_cfg_blk() and
hyperv_write_cfg_blk() are implemented in a separate module
(pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_INTERFACE) that
effectively stubs them out when running in non-Hyper-V
environments.

View File

@ -16,6 +16,11 @@
extern struct boot_params boot_params;
static struct real_mode_header hv_vtl_real_mode_header;
static bool __init hv_vtl_msi_ext_dest_id(void)
{
return true;
}
void __init hv_vtl_init_platform(void)
{
pr_info("Linux runs in Hyper-V Virtual Trust Level\n");
@ -38,6 +43,8 @@ void __init hv_vtl_init_platform(void)
x86_platform.legacy.warm_reset = 0;
x86_platform.legacy.reserve_bios_regions = 0;
x86_platform.legacy.devices.pnpbios = 0;
x86_init.hyper.msi_ext_dest_id = hv_vtl_msi_ext_dest_id;
}
static inline u64 hv_vtl_system_desc_base(struct ldttss_desc *desc)

View File

@ -15,6 +15,7 @@
#include <asm/io.h>
#include <asm/coco.h>
#include <asm/mem_encrypt.h>
#include <asm/set_memory.h>
#include <asm/mshyperv.h>
#include <asm/hypervisor.h>
#include <asm/mtrr.h>
@ -502,6 +503,31 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
return -EFAULT;
}
/*
* When transitioning memory between encrypted and decrypted, the caller
* of set_memory_encrypted() or set_memory_decrypted() is responsible for
* ensuring that the memory isn't in use and isn't referenced while the
* transition is in progress. The transition has multiple steps, and the
* memory is in an inconsistent state until all steps are complete. A
* reference while the state is inconsistent could result in an exception
* that can't be cleanly fixed up.
*
* But the Linux kernel load_unaligned_zeropad() mechanism could cause a
* stray reference that can't be prevented by the caller, so Linux has
* specific code to handle this case. But when the #VC and #VE exceptions
* routed to a paravisor, the specific code doesn't work. To avoid this
* problem, mark the pages as "not present" while the transition is in
* progress. If load_unaligned_zeropad() causes a stray reference, a normal
* page fault is generated instead of #VC or #VE, and the page-fault-based
* handlers for load_unaligned_zeropad() resolve the reference. When the
* transition is complete, hv_vtom_set_host_visibility() marks the pages
* as "present" again.
*/
static bool hv_vtom_clear_present(unsigned long kbuffer, int pagecount, bool enc)
{
return !set_memory_np(kbuffer, pagecount);
}
/*
* hv_vtom_set_host_visibility - Set specified memory visible to host.
*
@ -515,16 +541,28 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
enum hv_mem_host_visibility visibility = enc ?
VMBUS_PAGE_NOT_VISIBLE : VMBUS_PAGE_VISIBLE_READ_WRITE;
u64 *pfn_array;
phys_addr_t paddr;
void *vaddr;
int ret = 0;
bool result = true;
int i, pfn;
pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
if (!pfn_array)
return false;
if (!pfn_array) {
result = false;
goto err_set_memory_p;
}
for (i = 0, pfn = 0; i < pagecount; i++) {
pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
/*
* Use slow_virt_to_phys() because the PRESENT bit has been
* temporarily cleared in the PTEs. slow_virt_to_phys() works
* without the PRESENT bit while virt_to_hvpfn() or similar
* does not.
*/
vaddr = (void *)kbuffer + (i * HV_HYP_PAGE_SIZE);
paddr = slow_virt_to_phys(vaddr);
pfn_array[pfn] = paddr >> HV_HYP_PAGE_SHIFT;
pfn++;
if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
@ -538,14 +576,30 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
}
}
err_free_pfn_array:
err_free_pfn_array:
kfree(pfn_array);
err_set_memory_p:
/*
* Set the PTE PRESENT bits again to revert what hv_vtom_clear_present()
* did. Do this even if there is an error earlier in this function in
* order to avoid leaving the memory range in a "broken" state. Setting
* the PRESENT bits shouldn't fail, but return an error if it does.
*/
if (set_memory_p(kbuffer, pagecount))
result = false;
return result;
}
static bool hv_vtom_tlb_flush_required(bool private)
{
return true;
/*
* Since hv_vtom_clear_present() marks the PTEs as "not present"
* and flushes the TLB, they can't be in the TLB. That makes the
* flush controlled by this function redundant, so return "false".
*/
return false;
}
static bool hv_vtom_cache_flush_required(void)
@ -608,6 +662,7 @@ void __init hv_vtom_init(void)
x86_platform.hyper.is_private_mmio = hv_is_private_mmio;
x86_platform.guest.enc_cache_flush_required = hv_vtom_cache_flush_required;
x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required;
x86_platform.guest.enc_status_change_prepare = hv_vtom_clear_present;
x86_platform.guest.enc_status_change_finish = hv_vtom_set_host_visibility;
/* Set WB as the default cache mode. */

View File

@ -47,6 +47,7 @@ int set_memory_uc(unsigned long addr, int numpages);
int set_memory_wc(unsigned long addr, int numpages);
int set_memory_wb(unsigned long addr, int numpages);
int set_memory_np(unsigned long addr, int numpages);
int set_memory_p(unsigned long addr, int numpages);
int set_memory_4k(unsigned long addr, int numpages);
int set_memory_encrypted(unsigned long addr, int numpages);
int set_memory_decrypted(unsigned long addr, int numpages);

View File

@ -755,10 +755,14 @@ pmd_t *lookup_pmd_address(unsigned long address)
* areas on 32-bit NUMA systems. The percpu areas can
* end up in this kind of memory, for instance.
*
* This could be optimized, but it is only intended to be
* used at initialization time, and keeping it
* unoptimized should increase the testing coverage for
* the more obscure platforms.
* Note that as long as the PTEs are well-formed with correct PFNs, this
* works without checking the PRESENT bit in the leaf PTE. This is unlike
* the similar vmalloc_to_page() and derivatives. Callers may depend on
* this behavior.
*
* This could be optimized, but it is only used in paths that are not perf
* sensitive, and keeping it unoptimized should increase the testing coverage
* for the more obscure platforms.
*/
phys_addr_t slow_virt_to_phys(void *__virt_addr)
{
@ -2041,17 +2045,12 @@ int set_mce_nospec(unsigned long pfn)
return rc;
}
static int set_memory_p(unsigned long *addr, int numpages)
{
return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
}
/* Restore full speculative operation to the pfn. */
int clear_mce_nospec(unsigned long pfn)
{
unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
return set_memory_p(&addr, 1);
return set_memory_p(addr, 1);
}
EXPORT_SYMBOL_GPL(clear_mce_nospec);
#endif /* CONFIG_X86_64 */
@ -2104,6 +2103,11 @@ int set_memory_np_noalias(unsigned long addr, int numpages)
CPA_NO_CHECK_ALIAS, NULL);
}
int set_memory_p(unsigned long addr, int numpages)
{
return change_page_attr_set(&addr, numpages, __pgprot(_PAGE_PRESENT), 0);
}
int set_memory_4k(unsigned long addr, int numpages)
{
return change_page_attr_set_clr(&addr, numpages, __pgprot(0),

View File

@ -322,125 +322,89 @@ static int create_gpadl_header(enum hv_gpadl_type type, void *kbuffer,
pagecount = hv_gpadl_size(type, size) >> HV_HYP_PAGE_SHIFT;
/* do we need a gpadl body msg */
pfnsize = MAX_SIZE_CHANNEL_MESSAGE -
sizeof(struct vmbus_channel_gpadl_header) -
sizeof(struct gpa_range);
pfncount = umin(pagecount, pfnsize / sizeof(u64));
msgsize = sizeof(struct vmbus_channel_msginfo) +
sizeof(struct vmbus_channel_gpadl_header) +
sizeof(struct gpa_range) + pfncount * sizeof(u64);
msgheader = kzalloc(msgsize, GFP_KERNEL);
if (!msgheader)
return -ENOMEM;
INIT_LIST_HEAD(&msgheader->submsglist);
msgheader->msgsize = msgsize;
gpadl_header = (struct vmbus_channel_gpadl_header *)
msgheader->msg;
gpadl_header->rangecount = 1;
gpadl_header->range_buflen = sizeof(struct gpa_range) +
pagecount * sizeof(u64);
gpadl_header->range[0].byte_offset = 0;
gpadl_header->range[0].byte_count = hv_gpadl_size(type, size);
for (i = 0; i < pfncount; i++)
gpadl_header->range[0].pfn_array[i] = hv_gpadl_hvpfn(
type, kbuffer, size, send_offset, i);
*msginfo = msgheader;
pfnsum = pfncount;
pfnleft = pagecount - pfncount;
/* how many pfns can we fit in a body message */
pfnsize = MAX_SIZE_CHANNEL_MESSAGE -
sizeof(struct vmbus_channel_gpadl_body);
pfncount = pfnsize / sizeof(u64);
if (pagecount > pfncount) {
/* we need a gpadl body */
/* fill in the header */
/*
* If pfnleft is zero, everything fits in the header and no body
* messages are needed
*/
while (pfnleft) {
pfncurr = umin(pfncount, pfnleft);
msgsize = sizeof(struct vmbus_channel_msginfo) +
sizeof(struct vmbus_channel_gpadl_header) +
sizeof(struct gpa_range) + pfncount * sizeof(u64);
msgheader = kzalloc(msgsize, GFP_KERNEL);
if (!msgheader)
goto nomem;
INIT_LIST_HEAD(&msgheader->submsglist);
msgheader->msgsize = msgsize;
gpadl_header = (struct vmbus_channel_gpadl_header *)
msgheader->msg;
gpadl_header->rangecount = 1;
gpadl_header->range_buflen = sizeof(struct gpa_range) +
pagecount * sizeof(u64);
gpadl_header->range[0].byte_offset = 0;
gpadl_header->range[0].byte_count = hv_gpadl_size(type, size);
for (i = 0; i < pfncount; i++)
gpadl_header->range[0].pfn_array[i] = hv_gpadl_hvpfn(
type, kbuffer, size, send_offset, i);
*msginfo = msgheader;
pfnsum = pfncount;
pfnleft = pagecount - pfncount;
/* how many pfns can we fit */
pfnsize = MAX_SIZE_CHANNEL_MESSAGE -
sizeof(struct vmbus_channel_gpadl_body);
pfncount = pfnsize / sizeof(u64);
/* fill in the body */
while (pfnleft) {
if (pfnleft > pfncount)
pfncurr = pfncount;
else
pfncurr = pfnleft;
msgsize = sizeof(struct vmbus_channel_msginfo) +
sizeof(struct vmbus_channel_gpadl_body) +
pfncurr * sizeof(u64);
msgbody = kzalloc(msgsize, GFP_KERNEL);
if (!msgbody) {
struct vmbus_channel_msginfo *pos = NULL;
struct vmbus_channel_msginfo *tmp = NULL;
/*
* Free up all the allocated messages.
*/
list_for_each_entry_safe(pos, tmp,
&msgheader->submsglist,
msglistentry) {
list_del(&pos->msglistentry);
kfree(pos);
}
goto nomem;
}
msgbody->msgsize = msgsize;
gpadl_body =
(struct vmbus_channel_gpadl_body *)msgbody->msg;
sizeof(struct vmbus_channel_gpadl_body) +
pfncurr * sizeof(u64);
msgbody = kzalloc(msgsize, GFP_KERNEL);
if (!msgbody) {
struct vmbus_channel_msginfo *pos = NULL;
struct vmbus_channel_msginfo *tmp = NULL;
/*
* Gpadl is u32 and we are using a pointer which could
* be 64-bit
* This is governed by the guest/host protocol and
* so the hypervisor guarantees that this is ok.
* Free up all the allocated messages.
*/
for (i = 0; i < pfncurr; i++)
gpadl_body->pfn[i] = hv_gpadl_hvpfn(type,
kbuffer, size, send_offset, pfnsum + i);
list_for_each_entry_safe(pos, tmp,
&msgheader->submsglist,
msglistentry) {
/* add to msg header */
list_add_tail(&msgbody->msglistentry,
&msgheader->submsglist);
pfnsum += pfncurr;
pfnleft -= pfncurr;
list_del(&pos->msglistentry);
kfree(pos);
}
kfree(msgheader);
return -ENOMEM;
}
} else {
/* everything fits in a header */
msgsize = sizeof(struct vmbus_channel_msginfo) +
sizeof(struct vmbus_channel_gpadl_header) +
sizeof(struct gpa_range) + pagecount * sizeof(u64);
msgheader = kzalloc(msgsize, GFP_KERNEL);
if (msgheader == NULL)
goto nomem;
INIT_LIST_HEAD(&msgheader->submsglist);
msgheader->msgsize = msgsize;
msgbody->msgsize = msgsize;
gpadl_body = (struct vmbus_channel_gpadl_body *)msgbody->msg;
gpadl_header = (struct vmbus_channel_gpadl_header *)
msgheader->msg;
gpadl_header->rangecount = 1;
gpadl_header->range_buflen = sizeof(struct gpa_range) +
pagecount * sizeof(u64);
gpadl_header->range[0].byte_offset = 0;
gpadl_header->range[0].byte_count = hv_gpadl_size(type, size);
for (i = 0; i < pagecount; i++)
gpadl_header->range[0].pfn_array[i] = hv_gpadl_hvpfn(
type, kbuffer, size, send_offset, i);
/*
* Gpadl is u32 and we are using a pointer which could
* be 64-bit
* This is governed by the guest/host protocol and
* so the hypervisor guarantees that this is ok.
*/
for (i = 0; i < pfncurr; i++)
gpadl_body->pfn[i] = hv_gpadl_hvpfn(type,
kbuffer, size, send_offset, pfnsum + i);
*msginfo = msgheader;
/* add to msg header */
list_add_tail(&msgbody->msglistentry, &msgheader->submsglist);
pfnsum += pfncurr;
pfnleft -= pfncurr;
}
return 0;
nomem:
kfree(msgheader);
kfree(msgbody);
return -ENOMEM;
}
/*

View File

@ -296,6 +296,11 @@ static struct {
spinlock_t lock;
} host_ts;
static bool timesync_implicit;
module_param(timesync_implicit, bool, 0644);
MODULE_PARM_DESC(timesync_implicit, "If set treat SAMPLE as SYNC when clock is behind");
static inline u64 reftime_to_ns(u64 reftime)
{
return (reftime - WLTIMEDELTA) * 100;
@ -344,6 +349,29 @@ static void hv_set_host_time(struct work_struct *work)
do_settimeofday64(&ts);
}
/*
* Due to a bug on Hyper-V hosts, the sync flag may not always be sent on resume.
* Force a sync if the guest is behind.
*/
static inline bool hv_implicit_sync(u64 host_time)
{
struct timespec64 new_ts;
struct timespec64 threshold_ts;
new_ts = ns_to_timespec64(reftime_to_ns(host_time));
ktime_get_real_ts64(&threshold_ts);
threshold_ts.tv_sec += 5;
/*
* If guest behind the host by 5 or more seconds.
*/
if (timespec64_compare(&new_ts, &threshold_ts) >= 0)
return true;
return false;
}
/*
* Synchronize time with host after reboot, restore, etc.
*
@ -384,7 +412,8 @@ static inline void adj_guesttime(u64 hosttime, u64 reftime, u8 adj_flags)
spin_unlock_irqrestore(&host_ts.lock, flags);
/* Schedule work to do do_settimeofday64() */
if (adj_flags & ICTIMESYNCFLAG_SYNC)
if ((adj_flags & ICTIMESYNCFLAG_SYNC) ||
(timesync_implicit && hv_implicit_sync(host_ts.host_time)))
schedule_work(&adj_time_work);
}

View File

@ -988,7 +988,7 @@ static const struct dev_pm_ops vmbus_pm = {
};
/* The one and only one */
static struct bus_type hv_bus = {
static const struct bus_type hv_bus = {
.name = "vmbus",
.match = vmbus_match,
.shutdown = vmbus_shutdown,

View File

@ -1010,8 +1010,6 @@ static int hvfb_getmem(struct hv_device *hdev, struct fb_info *info)
goto getmem_done;
}
pr_info("Unable to allocate enough contiguous physical memory on Gen 1 VM. Using MMIO instead.\n");
} else {
goto err1;
}
/*

View File

@ -164,8 +164,28 @@ struct hv_ring_buffer {
u8 buffer[];
} __packed;
/*
* If the requested ring buffer size is at least 8 times the size of the
* header, steal space from the ring buffer for the header. Otherwise, add
* space for the header so that is doesn't take too much of the ring buffer
* space.
*
* The factor of 8 is somewhat arbitrary. The goal is to prevent adding a
* relatively small header (4 Kbytes on x86) to a large-ish power-of-2 ring
* buffer size (such as 128 Kbytes) and so end up making a nearly twice as
* large allocation that will be almost half wasted. As a contrasting example,
* on ARM64 with 64 Kbyte page size, we don't want to take 64 Kbytes for the
* header from a 128 Kbyte allocation, leaving only 64 Kbytes for the ring.
* In this latter case, we must add 64 Kbytes for the header and not worry
* about what's wasted.
*/
#define VMBUS_HEADER_ADJ(payload_sz) \
((payload_sz) >= 8 * sizeof(struct hv_ring_buffer) ? \
0 : sizeof(struct hv_ring_buffer))
/* Calculate the proper size of a ringbuffer, it must be page-aligned */
#define VMBUS_RING_SIZE(payload_sz) PAGE_ALIGN(sizeof(struct hv_ring_buffer) + \
#define VMBUS_RING_SIZE(payload_sz) PAGE_ALIGN(VMBUS_HEADER_ADJ(payload_sz) + \
(payload_sz))
struct hv_ring_buffer_info {