linux

iv/linux

Author	SHA1	Message	Date
Alex Williamson	8cc75183b7	vfio/pci: Cleanup Kconfig It should be possible to select vfio-pci variant drivers without building vfio-pci itself, which implies each variant driver should select vfio-pci-core. Fix the top level vfio Makefile to traverse pci based on vfio-pci-core rather than vfio-pci. Mark MMAP and INTX options depending on vfio-pci-core to cleanup resulting config if core is not enabled. Push all PCI related vfio options to a sub-menu and make descriptions consistent. Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20230614193948.477036-2-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-06-16 12:23:42 -06:00
Alex Williamson	a5bfe22db2	vfio/pci-core: Add capability for AtomicOp completer support Test and enable PCIe AtomicOp completer support of various widths and report via device-info capability to userspace. Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Robin Voetter <robin@streamhpc.com> Tested-by: Robin Voetter <robin@streamhpc.com> Link: https://lore.kernel.org/r/20230519214748.402003-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-06-16 12:22:18 -06:00
Alex Williamson	d9824f70e5	vfio/pci: Also demote hiding standard cap messages Apply the same logic as commit `912b625b4d` ("vfio/pci: demote hiding ecap messages to debug level") for the less common case of hiding standard capabilities. Reviewed-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/r/20230523225250.1215911-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-26 13:58:27 -06:00
Reinette Chatre	6c8017c6a5	vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X Dynamic MSI-X is supported. Clear VFIO_IRQ_INFO_NORESIZE to provide guidance to user space. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/fd1ef2bf6ae972da8e2805bc95d5155af5a8fb0a.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	e4163438e0	vfio/pci: Support dynamic MSI-X pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be allocated after MSI-X enabling. Use dynamic MSI-X (if supported by the device) to allocate an interrupt after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at the time a valid eventfd is assigned. This is different behavior from a range provided during MSI-X enabling where interrupts are allocated for the entire range whether a valid eventfd is provided for each interrupt or not. The PCI-MSIX API requires that some number of irqs are allocated for an initial set of vectors when enabling MSI-X on the device. When dynamic MSIX allocation is not supported, the vector table, and thus the allocated irq set can only be resized by disabling and re-enabling MSI-X with a different range. In that case the irq allocation is essentially a cache for configuring vectors within the previously allocated vector range. When dynamic MSI-X allocation is supported, the API still requires some initial set of irqs to be allocated, but also supports allocating and freeing specific irq vectors both within and beyond the initially allocated range. For consistency between modes, as well as to reduce latency and improve reliability of allocations, and also simplicity, this implementation only releases irqs via pci_free_irq_vectors() when either the interrupt mode changes or the device is released. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/lkml/20230403211841.0e206b67.alex.williamson@redhat.com/ Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/956c47057ae9fd45591feaa82e9ae20929889249.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	dd27a70700	vfio/pci: Probe and store ability to support dynamic MSI-X Not all MSI-X devices support dynamic MSI-X allocation. Whether a device supports dynamic MSI-X should be queried using pci_msix_can_alloc_dyn(). Instead of scattering code with pci_msix_can_alloc_dyn(), probe this ability once and store it as a property of the virtual device. Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/f1ae022c060ecb7e527f4f53c8ccafe80768da47.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	9387cf59dc	vfio/pci: Update stale comment In preparation for surrounding code change it is helpful to ensure that existing comments are accurate. Remove inaccurate comment about direct access and update the rest of the comment to reflect the purpose of writing the cached MSI message to the device. Suggested-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/lkml/20230330164050.0069e2a5.alex.williamson@redhat.com/ Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5b605ce7dcdab5a5dfef19cec4d73ae2fdad3ae1.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	63972f63a6	vfio/pci: Remove interrupt context counter struct vfio_pci_core_device::num_ctx counts how many interrupt contexts have been allocated. When all interrupt contexts are allocated simultaneously num_ctx provides the upper bound of all vectors that can be used as indices into the interrupt context array. With the upcoming support for dynamic MSI-X the number of interrupt contexts does not necessarily span the range of allocated interrupts. Consequently, num_ctx is no longer a trusted upper bound for valid indices. Stop using num_ctx to determine if a provided vector is valid. Use the existence of allocated interrupt. This changes behavior on the error path when user space provides an invalid vector range. Behavior changes from early exit without any modifications to possible modifications to valid vectors within the invalid range. This is acceptable considering that an invalid range is not a valid scenario, see link to discussion. The checks that ensure that user space provides a range of vectors that is valid for the device are untouched. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/lkml/20230316155646.07ae266f.alex.williamson@redhat.com/ Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/e27d350f02a65b8cbacd409b4321f5ce35b3186d.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	b156e48fff	vfio/pci: Use xarray for interrupt context storage Interrupt context is statically allocated at the time interrupts are allocated. Following allocation, the context is managed by directly accessing the elements of the array using the vector as index. The storage is released when interrupts are disabled. It is possible to dynamically allocate a single MSI-X interrupt after MSI-X is enabled. A dynamic storage for interrupt context is needed to support this. Replace the interrupt context array with an xarray (similar to what the core uses as store for MSI descriptors) that can support the dynamic expansion while maintaining the custom that uses the vector as index. With a dynamic storage it is no longer required to pre-allocate interrupt contexts at the time the interrupts are allocated. MSI and MSI-X interrupt contexts are only used when interrupts are enabled. Their allocation can thus be delayed until interrupt enabling. Only enabled interrupts will have associated interrupt contexts. Whether an interrupt has been allocated (a Linux irq number exists for it) becomes the criteria for whether an interrupt can be enabled. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/lkml/20230404122444.59e36a99.alex.williamson@redhat.com/ Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/40e235f38d427aff79ae35eda0ced42502aa0937.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	8850336588	vfio/pci: Move to single error path Enabling and disabling of an interrupt involves several steps that can fail. Cleanup after failure is done when the error is encountered, resulting in some repetitive code. Support for dynamic contexts will introduce more steps during interrupt enabling and disabling. Transition to centralized exit path in preparation for dynamic contexts to eliminate duplicate error handling code. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/72dddae8aa710ce522a74130120733af61cffe4d.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	d977e0f766	vfio/pci: Prepare for dynamic interrupt context storage Interrupt context storage is statically allocated at the time interrupts are allocated. Following allocation, the interrupt context is managed by directly accessing the elements of the array using the vector as index. It is possible to allocate additional MSI-X vectors after MSI-X has been enabled. Dynamic storage of interrupt context is needed to support adding new MSI-X vectors after initial allocation. Replace direct access of array elements with pointers to the array elements. Doing so reduces impact of moving to a new data structure. Move interactions with the array to helpers to mostly contain changes needed to transition to a dynamic data structure. No functional change intended. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/eab289693c8325ede9aba99380f8b8d5143980a4.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	6578ed85c7	vfio/pci: Remove negative check on unsigned vector User space provides the vector as an unsigned int that is checked early for validity (vfio_set_irqs_validate_and_prepare()). A later negative check of the provided vector is not necessary. Remove the negative check and ensure the type used for the vector is consistent as an unsigned int. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/28521e1b0b091849952b0ecb8c118729fc8cdc4f.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	a65f35cfd5	vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable vfio_msi_disable() releases all previously allocated state associated with each interrupt before disabling MSI/MSI-X. vfio_msi_disable() iterates twice over the interrupt state: first directly with a for loop to do virqfd cleanup, followed by another for loop within vfio_msi_set_block() that removes the interrupt handler and its associated state using vfio_msi_set_vector_signal(). Simplify interrupt cleanup by iterating over allocated interrupts once. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/837acb8cbe86a258a50da05e56a1f17c1a19abbe.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Oleksandr Natalenko	912b625b4d	vfio/pci: demote hiding ecap messages to debug level Seeing a burst of messages like this: vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0 vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x25@0x200 vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x26@0x210 vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x27@0x250 vfio-pci 0000:98:00.1: vfio_ecap_init: hiding ecap 0x25@0x200 vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0 vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x25@0x200 vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x26@0x210 vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x27@0x250 vfio-pci 0000:b1:00.1: vfio_ecap_init: hiding ecap 0x25@0x200 is of little to no value for an ordinary user. Hence, use pci_dbg() instead of pci_info(). Signed-off-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Cédric Le Goater <clg@redhat.com> Tested-by: YangHang Liu <yanghliu@redhat.com> Link: https://lore.kernel.org/r/20230504131654.24922-1-oleksandr@natalenko.name Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:47:48 -06:00
K V P, Satyanarayana	6467d0740a	vfio/pci: Add DVSEC PCI Extended Config Capability to user visible list. The Designated Vendor-Specific Extended Capability (DVSEC Capability) is an optional Extended Capability that is permitted to be implemented by any PCI Express Function. This allows PCI Express component vendors to use the Extended Capability mechanism to expose vendor-specific registers that can be present in components by a variety of vendors. A DVSEC Capability structure can tell vendor-specific software which features a particular component supports. An example usage of DVSEC is Intel Platform Monitoring Technology (PMT) for enumerating and accessing hardware monitoring capabilities on a device. PMT encompasses three device monitoring features, Telemetry (device metrics), Watcher (sampling/tracing), and Crashlog. The DVSEC is used to discover these features and provide a BAR offset to their registers with the Intel vendor code. The current VFIO driver does not pass DVSEC capabilities to Virtual Machine (VM) which makes PMT not to work inside the virtual machine. This series adds DVSEC capability to user visible list to allow its use with VFIO. VFIO supports passing of Vendor Specific Extended Capability (VSEC) and raw write access to device. DVSEC also passed to VM in the same way as of VSEC. Signed-off-by: K V P Satyanarayana <satyanarayana.k.v.p@intel.com> Link: https://lore.kernel.org/r/20230317082222.3355912-1-satyanarayana.k.v.p@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-04-14 14:03:07 -06:00
Yishai Hadas	4928f67bc9	vfio/mlx5: Fix the report of dirty_bytes upon pre-copy Fix the report of dirty_bytes upon pre-copy to include both the existing data on the migration file and the device extra bytes. This gives a better close estimation to what can be passed any more as part of pre-copy. Fixes: `0dce165b1a` ("vfio/mlx5: Introduce vfio precopy ioctl implementation") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230308155723.108218-1-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-03-13 12:50:59 -06:00
Linus Torvalds	cac85e4616	VFIO updates for v6.3-rc1 - Remove redundant resource check in vfio-platform. (Angus Chen) - Use GFP_KERNEL_ACCOUNT for persistent userspace allocations, allowing removal of arbitrary kernel limits in favor of cgroup control. (Yishai Hadas) - mdev tidy-ups, including removing the module-only build restriction for sample drivers, Kconfig changes to select mdev support, documentation movement to keep sample driver usage instructions with sample drivers rather than with API docs, remove references to out-of-tree drivers in docs. (Christoph Hellwig) - Fix collateral breakages from mdev Kconfig changes. (Arnd Bergmann) - Make mlx5 migration support match device support, improve source and target flows to improve pre-copy support and reduce downtime. (Yishai Hadas) - Convert additional mdev sysfs case to use sysfs_emit(). (Bo Liu) - Resolve copy-paste error in mdev mbochs sample driver Kconfig. (Ye Xingchen) - Avoid propagating missing reset error in vfio-platform if reset requirement is relaxed by module option. (Tomasz Duszynski) - Range size fixes in mlx5 variant driver for missed last byte and stricter range calculation. (Yishai Hadas) - Fixes to suspended vaddr support and locked_vm accounting, excluding mdev configurations from the former due to potential to indefinitely block kernel threads, fix underflow and restore locked_vm on new mm. (Steve Sistare) - Update outdated vfio documentation due to new IOMMUFD interfaces in recent kernels. (Yi Liu) - Resolve deadlock between group_lock and kvm_lock, finally. (Matthew Rosato) - Fix NULL pointer in group initialization error path with IOMMUFD. (Yan Zhao) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmP5GC0bHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiGoMP/Ajgc05dq2HGt0ZdTj3d /2fgFa/8GXv9t/Md4neHkvKppeHsyL6R9s/OlGb2zQMrZ9wTurW5s4pW4fLIcpNV v1vyQSLYMCtj/FT3kG38fZdJwF9NGnC+B+bY4ak+V2rWaKs2vT6fUG6YpzxuBU3T jRD41frtszXIp3i8bIPfaoKt/SydUrx12UJAKSks4eDM4aOlxKhpc3VB1vwaSmHB MgZMRPVQOGUubKJWb3u07tYOd8NHpBpD3HVUb8IlB2//tSqSPgq3GaKr/B25YzH+ 192vgGrm19aKYQ4U0KPLSH4QGG01bia4LqArbVAhBMwzgKK1dE24dk2YBVj+yePx 5XXHWv85gLpkev5aLAxsN75/qCtwhYYYB9vBohp8jhXjQU1GXdj9DAht5+c5I3sk SZcczmtuZ10X2XXT7fA5iRsG7o3Uxg1VikxYLT0Zhu/0DLc+wQrvum+mmu3sKscx qcJyTQXhNTDFzBRRTw6KdyCShbG9gFITysf9Xw/n2y3bxzlfy3Ttf617auYFv6fQ ed3kGiT+S16U/dr2b99qQZyn1eIbzOSkz/oWOXwvCWoBdPTEks9f7pDn9Kk6O641 8tf7qj3vpkOccg71EbVCF6JV5JrhtXDOJVzWIkfQWkoi7qI4ONZ/EdEGTnWY77RY urbhuR4UO1iG0nX+yQIFXhDR =QqPa -----END PGP SIGNATURE----- Merge tag 'vfio-v6.3-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - Remove redundant resource check in vfio-platform (Angus Chen) - Use GFP_KERNEL_ACCOUNT for persistent userspace allocations, allowing removal of arbitrary kernel limits in favor of cgroup control (Yishai Hadas) - mdev tidy-ups, including removing the module-only build restriction for sample drivers, Kconfig changes to select mdev support, documentation movement to keep sample driver usage instructions with sample drivers rather than with API docs, remove references to out-of-tree drivers in docs (Christoph Hellwig) - Fix collateral breakages from mdev Kconfig changes (Arnd Bergmann) - Make mlx5 migration support match device support, improve source and target flows to improve pre-copy support and reduce downtime (Yishai Hadas) - Convert additional mdev sysfs case to use sysfs_emit() (Bo Liu) - Resolve copy-paste error in mdev mbochs sample driver Kconfig (Ye Xingchen) - Avoid propagating missing reset error in vfio-platform if reset requirement is relaxed by module option (Tomasz Duszynski) - Range size fixes in mlx5 variant driver for missed last byte and stricter range calculation (Yishai Hadas) - Fixes to suspended vaddr support and locked_vm accounting, excluding mdev configurations from the former due to potential to indefinitely block kernel threads, fix underflow and restore locked_vm on new mm (Steve Sistare) - Update outdated vfio documentation due to new IOMMUFD interfaces in recent kernels (Yi Liu) - Resolve deadlock between group_lock and kvm_lock, finally (Matthew Rosato) - Fix NULL pointer in group initialization error path with IOMMUFD (Yan Zhao) * tag 'vfio-v6.3-rc1' of https://github.com/awilliam/linux-vfio: (32 commits) vfio: Fix NULL pointer dereference caused by uninitialized group->iommufd docs: vfio: Update vfio.rst per latest interfaces vfio: Update the kdoc for vfio_device_ops vfio/mlx5: Fix range size calculation upon tracker creation vfio: no need to pass kvm pointer during device open vfio: fix deadlock between group lock and kvm lock vfio: revert "iommu driver notify callback" vfio/type1: revert "implement notify callback" vfio/type1: revert "block on invalid vaddr" vfio/type1: restore locked_vm vfio/type1: track locked_vm per dma vfio/type1: prevent underflow of locked_vm via exec() vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR vfio: platform: ignore missing reset if disabled at module init vfio/mlx5: Improve the target side flow to reduce downtime vfio/mlx5: Improve the source side flow upon pre_copy vfio/mlx5: Check whether VF is migratable samples: fix the prompt about SAMPLE_VFIO_MDEV_MBOCHS vfio/mdev: Use sysfs_emit() to instead of sprintf() vfio-mdev: add back CONFIG_VFIO dependency ...	2023-02-25 11:52:57 -08:00
Suren Baghdasaryan	1c71222e5f	mm: replace vma->vm_flags direct modifications with modifier calls Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-09 16:51:39 -08:00
Yishai Hadas	ce06a7000f	vfio/mlx5: Fix range size calculation upon tracker creation Fix range size calculation to include the last byte of each range. In addition, log round up the length of the total ranges to be stricter. Fixes: `c1d050b0d1` ("vfio/mlx5: Create and destroy page tracker object") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230208152234.32370-1-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-09 11:43:06 -07:00
Yishai Hadas	f4f0c25e5d	vfio/mlx5: Improve the target side flow to reduce downtime Improve the target side flow to reduce downtime as of below. - Support reading an optional record which includes the expected stop_copy size. - Once the source sends this record data, which expects to be sent as part of the pre_copy flow, prepare the data buffers that may be large enough to hold the final stop_copy data. The above reduces the migration downtime as the relevant stuff that is needed to load the image data is prepared ahead as part of pre_copy. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-30 12:16:15 -07:00
Yishai Hadas	b04e2e86e9	vfio/mlx5: Improve the source side flow upon pre_copy Improve the source side flow upon pre_copy as of below. - Prepare the stop_copy buffers as part of moving to pre_copy. - Send to the target a record that includes the expected stop_copy size to let it optimize its stop_copy flow as well. As for sending the target this new record type (i.e. MLX5_MIGF_HEADER_TAG_STOP_COPY_SIZE) we split the current 64 header flags bits into 32 flags bits and another 32 tag bits, each record may have a tag and a flag whether it's optional or mandatory. Optional records will be ignored in the target. The above reduces the downtime upon stop_copy as the relevant data stuff is prepared ahead as part of pre_copy. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-30 12:16:15 -07:00
Shay Drory	caf094b5a1	vfio/mlx5: Check whether VF is migratable Add a check whether VF is migratable. Only if VF is migratable, mark the VFIO device as migration capable. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-2-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-30 12:16:15 -07:00
Yishai Hadas	cb8285b89f	vfio/hisi: Use GFP_KERNEL_ACCOUNT for userspace persistent allocations Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:30 -07:00
Jason Gunthorpe	0886196ca8	vfio: Use GFP_KERNEL_ACCOUNT for userspace persistent allocations Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. The way to find the relevant allocations was for example to look at the close_device function and trace back all the kfrees to their allocations. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:29 -07:00
Yishai Hadas	83ff6095ec	vfio/mlx5: Allow loading of larger images than 512 MB Allow loading of larger images than 512 MB by dropping the arbitrary hard-coded value that we have today and move to use the max device loading value which is for now 4GB. As part of that we move to use the GFP_KERNEL_ACCOUNT option upon allocating the persistent data of mlx5 and rely on the cgroup to provide the memory limit for the given user. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:29 -07:00
Yishai Hadas	c9c4c070e0	vfio/mlx5: Fix UBSAN note Prevent calling roundup_pow_of_two() with value of 0 as it causes the below UBSAN note. Move this code and its few extra related lines to be called only when it's really applicable. UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 shift exponent 64 is too large for 64-bit type 'long unsigned int' CPU: 15 PID: 1639 Comm: live_migration Not tainted 6.1.0-rc4 #1116 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x45/0x59 ubsan_epilogue+0x5/0x36 __ubsan_handle_shift_out_of_bounds.cold+0x61/0xef ? lock_is_held_type+0x98/0x110 ? rcu_read_lock_sched_held+0x3f/0x70 mlx5vf_create_rc_qp.cold+0xe4/0xf2 [mlx5_vfio_pci] mlx5vf_start_page_tracker+0x769/0xcd0 [mlx5_vfio_pci] vfio_device_fops_unl_ioctl+0x63f/0x700 [vfio] __x64_sys_ioctl+0x433/0x9a0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> Fixes: `79c3cf2799` ("vfio/mlx5: Init QP based resources for dirty tracking") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-2-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:29 -07:00
Linus Torvalds	785d21ba2f	VFIO updates for v6.2-rc1 - Replace deprecated git://github.com link in MAINTAINERS. (Palmer Dabbelt) - Simplify vfio/mlx5 with module_pci_driver() helper. (Shang XiaoJing) - Drop unnecessary buffer from ACPI call. (Rafael Mendonca) - Correct latent missing include issue in iova-bitmap and fix support for unaligned bitmaps. Follow-up with better fix through refactor. (Joao Martins) - Rework ccw mdev driver to split private data from parent structure, better aligning with the mdev lifecycle and allowing us to remove a temporary workaround. (Eric Farman) - Add an interface to get an estimated migration data size for a device, allowing userspace to make informed decisions, ex. more accurately predicting VM downtime. (Yishai Hadas) - Fix minor typo in vfio/mlx5 array declaration. (Yishai Hadas) - Simplify module and Kconfig through consolidating SPAPR/EEH code and config options and folding virqfd module into main vfio module. (Jason Gunthorpe) - Fix error path from device_register() across all vfio mdev and sample drivers. (Alex Williamson) - Define migration pre-copy interface and implement for vfio/mlx5 devices, allowing portions of the device state to be saved while the device continues operation, towards reducing the stop-copy state size. (Jason Gunthorpe, Yishai Hadas, Shay Drory) - Implement pre-copy for hisi_acc devices. (Shameer Kolothum) - Fixes to mdpy mdev driver remove path and error path on probe. (Shang XiaoJing) - vfio/mlx5 fixes for incorrect return after copy_to_user() fault and incorrect buffer freeing. (Dan Carpenter) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmObfPgbHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiDogP/i9GuBKposvZpnfxXWwo oNpKBZSOVMW8wgavNEuryMb+9WoouIghce8XU49MmONoP26kIh5TA14Zpi3XWkLK K+NlpwicESvLeZVHU7f3R8meVqmPtlxIi59jE+CfEHB8BW2HIAsEdwdhkxMwus9C nuiiK/2YYyQWOXYc4LAIkspMzjtGPy6Im5P6AED+dI+TFCEqJAM5qgOLJZFlk4a/ WwZY2xjVKOl6xf5VZXGw+v7fDgz2Ju+j4Bm3X5lx1HgiDrEH83MjXY5h67neAIVb bXrfNLN++MiuO5niGTFMbUjGVUIFxsfmJzBnL9QrLsuj0JrGEKsu/1JEO78g0Km0 ZCChoJ6UyUOgxt6evEymUAZAAkbcKaaht2gdbAXW71tv9p1TripAbBKwVeah1bQp SiHPqy9InKJlhaf+GbXL9eux1WVMfQ6FZccU16bNt7VaV2I8js85z/2gqVD0a5Mw +gnwp5XMUFWNKlJrnc7uVCD0bDExwQhr75OP4rWjMNvvLi9hPXJ2cI2Sg+9OLzQw vm/I+Df+FfXCuGAgX4Lxq76pqWlYGJH0Qxc14Ds6YoXqygBPz9yvTtuBv8mTHJzE KdAl/6DmZZxZ/JFD9lPF80KRiAsJ6iNf6tPTWES7hfDBfIdgQ/DZbXridLWJPNoi xLfaW19yrLTXWKSmR7G2Lsz4 =q9xs -----END PGP SIGNATURE----- Merge tag 'vfio-v6.2-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - Replace deprecated git://github.com link in MAINTAINERS (Palmer Dabbelt) - Simplify vfio/mlx5 with module_pci_driver() helper (Shang XiaoJing) - Drop unnecessary buffer from ACPI call (Rafael Mendonca) - Correct latent missing include issue in iova-bitmap and fix support for unaligned bitmaps. Follow-up with better fix through refactor (Joao Martins) - Rework ccw mdev driver to split private data from parent structure, better aligning with the mdev lifecycle and allowing us to remove a temporary workaround (Eric Farman) - Add an interface to get an estimated migration data size for a device, allowing userspace to make informed decisions, ex. more accurately predicting VM downtime (Yishai Hadas) - Fix minor typo in vfio/mlx5 array declaration (Yishai Hadas) - Simplify module and Kconfig through consolidating SPAPR/EEH code and config options and folding virqfd module into main vfio module (Jason Gunthorpe) - Fix error path from device_register() across all vfio mdev and sample drivers (Alex Williamson) - Define migration pre-copy interface and implement for vfio/mlx5 devices, allowing portions of the device state to be saved while the device continues operation, towards reducing the stop-copy state size (Jason Gunthorpe, Yishai Hadas, Shay Drory) - Implement pre-copy for hisi_acc devices (Shameer Kolothum) - Fixes to mdpy mdev driver remove path and error path on probe (Shang XiaoJing) - vfio/mlx5 fixes for incorrect return after copy_to_user() fault and incorrect buffer freeing (Dan Carpenter) * tag 'vfio-v6.2-rc1' of https://github.com/awilliam/linux-vfio: (42 commits) vfio/mlx5: error pointer dereference in error handling vfio/mlx5: fix error code in mlx5vf_precopy_ioctl() samples: vfio-mdev: Fix missing pci_disable_device() in mdpy_fb_probe() hisi_acc_vfio_pci: Enable PRE_COPY flag hisi_acc_vfio_pci: Move the dev compatibility tests for early check hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitions hisi_acc_vfio_pci: Add support for precopy IOCTL vfio/mlx5: Enable MIGRATION_PRE_COPY flag vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY error vfio/mlx5: Introduce multiple loads vfio/mlx5: Consider temporary end of stream as part of PRE_COPY vfio/mlx5: Introduce vfio precopy ioctl implementation vfio/mlx5: Introduce SW headers for migration states vfio/mlx5: Introduce device transitions of PRE_COPY vfio/mlx5: Refactor to use queue based data chunks vfio/mlx5: Refactor migration file state vfio/mlx5: Refactor MKEY usage vfio/mlx5: Refactor PD usage vfio/mlx5: Enforce a single SAVE command at a time vfio: Extend the device migration protocol with PRE_COPY ...	2022-12-15 13:12:15 -08:00
Dan Carpenter	70be6f3228	vfio/mlx5: error pointer dereference in error handling This code frees the wrong "buf" variable and results in an error pointer dereference. Fixes: `34e2f27143` ("vfio/mlx5: Introduce multiple loads") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/Y5IKia5SaiVxYmG5@kili Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-12 14:10:12 -07:00
Dan Carpenter	fe3dd71db2	vfio/mlx5: fix error code in mlx5vf_precopy_ioctl() The copy_to_user() function returns the number of bytes remaining to be copied but we want to return a negative error code here. Fixes: `0dce165b1a` ("vfio/mlx5: Introduce vfio precopy ioctl implementation") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/Y5IKVknlf5Z5NPtU@kili Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-12 14:10:12 -07:00
Shameer Kolothum	f2240b4441	hisi_acc_vfio_pci: Enable PRE_COPY flag Now that we have everything to support the PRE_COPY state, enable it. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-5-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:37:11 -07:00
Shameer Kolothum	190125adca	hisi_acc_vfio_pci: Move the dev compatibility tests for early check Instead of waiting till data transfer is complete to perform dev compatibility, do it as soon as we have enough data to perform the check. This will be useful when we enable the support for PRE_COPY. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-4-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:37:11 -07:00
Shameer Kolothum	d9a871e4a1	hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitions The saving_migf is open in PRE_COPY state if it is supported and reads initial device match data. hisi_acc_vf_stop_copy() is refactored to make use of common code. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-3-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:37:11 -07:00
Shameer Kolothum	64ffbbb1e9	hisi_acc_vfio_pci: Add support for precopy IOCTL PRECOPY IOCTL in the case of HiSiIicon ACC driver can be used to perform the device compatibility check earlier during migration. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-2-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:37:11 -07:00
Shay Drory	ccc2a52e46	vfio/mlx5: Enable MIGRATION_PRE_COPY flag Now that everything has been set up for MIGRATION_PRE_COPY, enable it. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-15-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Shay Drory	d6e18a4bec	vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY error Before a SAVE command is issued, a QUERY command is issued in order to know the device data size. In case PRE_COPY is used, the above commands are issued while the device is running. Thus, it is possible that between the QUERY and the SAVE commands the state of the device will be changed significantly and thus the SAVE will fail. Currently, if a SAVE command is failing, the driver will fail the migration. In the above case, don't fail the migration, but don't allow for new SAVEs to be executed while the device is in a RUNNING state. Once the device will be moved to STOP_COPY, SAVE can be executed again and the full device state will be read. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-14-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	34e2f27143	vfio/mlx5: Introduce multiple loads In order to support PRE_COPY, mlx5 driver transfers multiple states (images) of the device. e.g.: the source VF can save and transfer multiple states, and the target VF will load them by that order. This patch implements the changes for the target VF to decompose the header for each state and to write and load multiple states. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-13-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	81156c2727	vfio/mlx5: Consider temporary end of stream as part of PRE_COPY During PRE_COPY the migration data FD may have a temporary "end of stream" that is reached when the initial_bytes were read and no other dirty data exists yet. For instance, this may indicate that the device is idle and not currently dirtying any internal state. When read() is done on this temporary end of stream the kernel driver should return ENOMSG from read(). Userspace can wait for more data or consider moving to STOP_COPY. To not block the user upon read() and let it get ENOMSG we add a new state named MLX5_MIGF_STATE_PRE_COPY on the migration file. In addition, we add the MLX5_MIGF_STATE_SAVE_LAST state to block the read() once we call the last SAVE upon moving to STOP_COPY. Any further error will be marked with MLX5_MIGF_STATE_ERROR and the user won't be blocked. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-12-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	0dce165b1a	vfio/mlx5: Introduce vfio precopy ioctl implementation vfio precopy ioctl returns an estimation of data available for transferring from the device. Whenever a user is using VFIO_MIG_GET_PRECOPY_INFO, track the current state of the device, and if needed, append the dirty data to the transfer FD data. This is done by saving a middle state. As mlx5 runs the SAVE command asynchronously, make sure to query for incremental data only once there is no active save command. Running both in parallel, might end-up with a failure in the incremental query command on un-tracked vhca. Also, a middle state will be saved only after the previous state has finished its SAVE command and has been fully transferred, this prevents endless use resources. Co-developed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-11-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	0c9a38fee8	vfio/mlx5: Introduce SW headers for migration states As mentioned in the previous patches, mlx5 is transferring multiple states when the PRE_COPY protocol is used. This states mechanism requires the target VM to know the states' size in order to execute multiple loads. Therefore, add SW header, with the needed information, for each saved state the source VM is transferring to the target VM. This patch implements the source VM handling of the headers, following patch will implement the target VM handling of the headers. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-10-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	3319d287f4	vfio/mlx5: Introduce device transitions of PRE_COPY In order to support PRE_COPY, mlx5 driver is transferring multiple states (images) of the device. e.g.: the source VF can save and transfer multiple states, and the target VF will load them by that order. The device is saving three kinds of states: 1) Initial state - when the device moves to PRE_COPY state. 2) Middle state - during PRE_COPY phase via VFIO_MIG_GET_PRECOPY_INFO. There can be multiple states of this type. 3) Final state - when the device moves to STOP_COPY state. After moving to PRE_COPY state, user is holding the saving migf FD and can use it. For example: user can start transferring data via read() callback. Also, user can switch from PRE_COPY to STOP_COPY whenever he sees it fits. This will invoke saving of final state. This means that mlx5 VFIO device can be switched to STOP_COPY without transferring any data in PRE_COPY state. Therefore, when the device moves to STOP_COPY, mlx5 will store the final state on a dedicated queue entry on the list. Co-developed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-9-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	c668878381	vfio/mlx5: Refactor to use queue based data chunks Refactor to use queue based data chunks on the migration file. The SAVE command adds a chunk to the tail of the queue while the read() API finds the required chunk and returns its data. In case the queue is empty but the state of the migration file is MLX5_MIGF_STATE_COMPLETE, read() may not be blocked but will return 0 to indicate end of file. This is a step towards maintaining multiple images and their meta data (i.e. headers) on the migration file as part of next patches from the series. Note: At that point, we still use a single chunk on the migration file but becomes ready to support multiple. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-8-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	8b599d1434	vfio/mlx5: Refactor migration file state Refactor migration file state to be an emum which is mutual exclusive. As of that dropped the 'disabled' state as 'error' is the same from functional point of view. Next patches from the series will extend this enum for other relevant states. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-7-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	91454f8b9b	vfio/mlx5: Refactor MKEY usage This patch refactors MKEY usage such as its life cycle will be as of the migration file instead of allocating/destroying it upon each SAVE/LOAD command. This is a preparation step towards the PRE_COPY series where multiple images will be SAVED/LOADED. We achieve it by having a new struct named mlx5_vhca_data_buffer which holds the mkey and its related stuff as of sg_append_table, allocated_length, etc. The above fields were taken out from the migration file main struct, into mlx5_vhca_data_buffer dedicated struct with the proper helpers in place. For now we have a single mlx5_vhca_data_buffer per migration file. However, in coming patches we'll have multiple of them to support multiple images. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	9945a67ea4	vfio/mlx5: Refactor PD usage This patch refactors PD usage such as its life cycle will be as of the migration file instead of allocating/destroying it upon each SAVE/LOAD command. This is a preparation step towards the PRE_COPY series where multiple images will be SAVED/LOADED and a single PD can be simply reused. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	0e7caa65d7	vfio/mlx5: Enforce a single SAVE command at a time Enforce a single SAVE command at a time. As the SAVE command is an asynchronous one, we must enforce running only a single command at a time. This will preserve ordering between multiple calls and protect from races on the migration file data structure. This is a must for the next patches from the series where as part of PRE_COPY we may have multiple images to be saved and multiple SAVE commands may be issued from different flows. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Jason Gunthorpe	20601c45a0	vfio: Remove CONFIG_VFIO_SPAPR_EEH We don't need a kconfig symbol for this, just directly test CONFIG_EEH in the few places that need it. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4-v5-fc5346cacfd4+4c482-vfio_modules_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-05 12:04:32 -07:00
Jason Gunthorpe	8f8bcc8c72	vfio/pci: Move all the SPAPR PCI specific logic to vfio_pci_core.ko The vfio_spapr_pci_eeh_open/release() functions are one line wrappers around an arch function. Just call them directly. This eliminates some weird exported symbols that don't need to exist. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Link: https://lore.kernel.org/r/1-v5-fc5346cacfd4+4c482-vfio_modules_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-05 12:04:32 -07:00
Jason Gunthorpe	90337f526c	Merge tag 'v6.1-rc7' into iommufd.git for-next Resolve conflicts in drivers/vfio/vfio_main.c by using the iommfd version. The rc fix was done a different way when iommufd patches reworked this code. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-12-02 12:04:39 -04:00
Jason Gunthorpe	a4d1f91db5	vfio-iommufd: Support iommufd for physical VFIO devices This creates the iommufd_device for the physical VFIO drivers. These are all the drivers that are calling vfio_register_group_dev() and expect the type1 code to setup a real iommu_domain against their parent struct device. The design gives the driver a choice in how it gets connected to iommufd by providing bind_iommufd/unbind_iommufd/attach_ioas callbacks to implement as required. The core code provides three default callbacks for physical mode using a real iommu_domain. This is suitable for drivers using vfio_register_group_dev() Link: https://lore.kernel.org/r/6-v4-42cd2eb0e3eb+335a-vfio_iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-12-02 11:52:03 -04:00
Yishai Hadas	2f5d8cef45	vfio/mlx5: Fix a typo in mlx5vf_cmd_load_vhca_state() Fix a typo in mlx5vf_cmd_load_vhca_state() to use the 'load' memory layout. As in/out sizes are equal for save and load commands there wasn't any functional issue. Fixes: `f1d98f346e` ("vfio/mlx5: Expose migration commands over mlx5 device") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20221106174630.25909-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-11-14 11:37:07 -07:00

1 2 3 4 5 ...

373 Commits