IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The internal mechanisms support this, but instead of exposting the gfp to
the caller it wrappers it into iommu_map() and iommu_map_atomic()
Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Link: https://lore.kernel.org/r/1-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
For device tree nodes, use the standard of_iommu_get_resv_regions()
implementation to obtain the reserved memory regions associated with a
device.
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: devicetree@vger.kernel.org
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20230120174251.4004100-5-thierry.reding@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This is an implementation that IOMMU drivers can use to obtain reserved
memory regions from a device tree node. It uses the reserved-memory DT
bindings to find the regions associated with a given device. If these
regions are marked accordingly, identity mappings will be created for
them in the IOMMU domain that the devices will be attached to.
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: devicetree@vger.kernel.org
Reviewed-by: Rob Herring <robh@kernel.org>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20230120174251.4004100-4-thierry.reding@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The Qualcomm display driver installs a translation domain once it has
mapped a framebuffer. Use the identity domain for this device on
SC8280XP as well, to avoid faults from EFI FB accessing the framebuffer
while this is being set up.
Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>
Link: https://lore.kernel.org/r/20230113041104.4189152-1-quic_bjorande@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
Add the SM8150 DPU compatible to clients compatible list, as it also
needs the workarounds.
Signed-off-by: Marijn Suijten <marijn.suijten@somainline.org>
Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
Reviewed-by: Dmitry Baryshkov<dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/20221212121054.193059-1-konrad.dybcio@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
struct iommu_ioas_copy, struct iommu_option and struct iommu_vfio_ioas are
missed in ucmd_buffer. Although they are smaller than the size of
ucmd_buffer, it is safer to list them in ucmd_buffer explicitly.
Fixes: aad37e71d5 ("iommufd: IOCTLs for the io_pagetable")
Fixes: d624d6652a ("iommufd: vfio container FD ioctl compatibility")
Link: https://lore.kernel.org/r/20230120122040.280219-1-yi.l.liu@intel.com
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If platform_driver_register() fails, it don't need unregister and call
kmem_cache_free() to free the memory allocated before calling register.
Fixes: bbc4d205d9 ("iommu/exynos: Fix driver initialization sequence")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Sam Protsenko <semen.protsenko@linaro.org>
Link: https://lore.kernel.org/r/20230104095702.2591122-1-yangyingliang@huawei.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Current code clears event log and ppr log entry after processing it due
to hardware errata ([1] erratum #732, #733). We do not have hardware
issue on SNP enabled system.
When SNP is enabled, the event logs, PPR log and completion wait buffer
are read-only to the host (see SNP FW ABI spec [2]). Clearing those entry
will result in a kernel #PF for an RMP violation. Hence do not clear
event and ppr log entry after processing it.
[1] http://developer.amd.com/wordpress/media/2012/10/48931_15h_Mod_10h-1Fh_Rev_Guide.pdf
[2] https://www.amd.com/system/files/TechDocs/56860.pdf
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20230117044038.5728-1-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Now that we have the driver properly parameterized, we can add support
for T8110 DARTs. These DARTs drop the multiple TTBRs (which only make
sense with legacy 4K page platforms) and instead add support for new
features and more stream IDs. The register layout is different, but the
pagetable format is the same as T6000.
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Hector Martin <marcan@marcan.st>
Link: https://lore.kernel.org/r/20230113105029.26654-8-marcan@marcan.st
Signed-off-by: Joerg Roedel <jroedel@suse.de>
T8110 has a new register layout. To accommodate this, first move all the
register offsets to the hw structure, and rename all the existing
registers to DART_T8020_*.
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Hector Martin <marcan@marcan.st>
Link: https://lore.kernel.org/r/20230113105029.26654-7-marcan@marcan.st
Signed-off-by: Joerg Roedel <jroedel@suse.de>
They didn't have the PARAMS reg index in them, but they should.
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Hector Martin <marcan@marcan.st>
Link: https://lore.kernel.org/r/20230113105029.26654-6-marcan@marcan.st
Signed-off-by: Joerg Roedel <jroedel@suse.de>
T8110 only has one TTBR per stream, so un-hardcode that.
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Hector Martin <marcan@marcan.st>
Link: https://lore.kernel.org/r/20230113105029.26654-5-marcan@marcan.st
Signed-off-by: Joerg Roedel <jroedel@suse.de>
T8110 DARTs have up to 256 SIDs, so we need to switch to a bitmap to
handle them properly.
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Hector Martin <marcan@marcan.st>
Link: https://lore.kernel.org/r/20230113105029.26654-4-marcan@marcan.st
Signed-off-by: Joerg Roedel <jroedel@suse.de>
We need to save/restore the TCR/TTBR registers, since they are lost
on power gate.
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Hector Martin <marcan@marcan.st>
Link: https://lore.kernel.org/r/20230113105029.26654-3-marcan@marcan.st
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The struct initializer for set_platform_dma_ops uses a semicolon as
separator where a comma is required. Fix the compile error by using the
correct separator.
Fixes: c1fe9119ee ("iommu: Add set_platform_dma_ops callbacks")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230113191528.23638-1-joro@8bytes.org
The function is unused after commit 1b932ceddd ("iommu:
Remove detach_dev callbacks") and so compilation fails with
drivers/iommu/ipmmu-vmsa.c:305:13: error: ‘ipmmu_utlb_disable’ defined but not used [-Werror=unused-function]
305 | static void ipmmu_utlb_disable(struct ipmmu_vmsa_domain *domain,
| ^~~~~~~~~~~~~~~~~~
Remove the function to fix the compile error.
Fixes: 1b932ceddd ("iommu: Remove detach_dev callbacks")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230113185640.8050-1-joro@8bytes.org
In passthrough mode we do not use IOMMU page table. Hence we don't need
to allocate io_pgtable_ops.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20230105091728.42469-1-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Some io-pgtable implementations, and thus their users too, carry a
slightly odd dependency to get around the GENERIC_ATOMIC64 version of
cmpxchg64() often failing to compile. Since this is a functional
dependency, it's a bit misleading and untidy to tie it explicitly to
COMPILE_TEST while assuming that it's also implied by the other
platform/architecture options. Make things clearer by separating these
functional dependencies into distinct statements from those controlling
visibility, and since they do look a bit non-obvious to the uninitiated,
also commenting them for good measure.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/51d8c78e2ecc6696ac5907526580209ea6da167f.1673553587.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The detach_dev callback of domain ops is not called in the IOMMU core.
Remove this callback to avoid dead code. The trace event for detaching
domain from device is removed accordingly.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20230110025408.667767-6-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
At the current moment, __iommu_detach_device() is only called via call
chains that are after the device driver is attached - eg via explicit
attach APIs called by the device driver.
Commit bd421264ed ("iommu: Fix deferred domain attachment") has removed
deferred domain attachment check from __iommu_attach_device() path, so it
should just unconditionally work in the __iommu_detach_device() path.
It actually looks like a bug that we were blocking detach on these paths
since the attach was unconditional and the caller is going to free the
(probably) UNAMANGED domain once this returns.
The only place we should be testing for deferred attach is during the
initial point the dma device is linked to the group, and then again
during the dma api calls.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20230110025408.667767-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
For those IOMMU drivers that don't provide default domain support, add an
implementation of set_platform_dma_ops callback so that the IOMMU core
could return the DMA control to platform DMA ops. At the same time, with
the set_platform_dma_ops implemented, there is no need for detach_dev.
Remove it to avoid dead code.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20230110025408.667767-4-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When VFIO finishes assigning a device to user space and calls
iommu_group_release_dma_owner() to return the device to kernel, the IOMMU
core will attach the default domain to the device. Unfortunately, some
IOMMU drivers don't support default domain, hence in the end, the core
calls .detach_dev instead.
This adds set_platform_dma_ops iommu ops to make it clear that what it
does is returning control back to the platform DMA ops.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20230110025408.667767-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The iommu core calls the driver's detach_dev domain op callback only when
a device is finished assigning to user space and
iommu_group_release_dma_owner() is called to return the device to the
kernel, where iommu core wants to set the default domain to the device but
the driver didn't provide one.
In other words, if any iommu driver provides default domain support, the
.detach_dev callback will never be called. This removes the detach_dev
callbacks in those IOMMU drivers that support default domain.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev> # apple-dart
Acked-by: Chunyan Zhang <zhang.lyra@gmail.com> # sprd
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> # amd
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20230110025408.667767-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
A clk, prepared and enabled in mtk_iommu_v1_hw_init(), is not released in
the error handling path of mtk_iommu_v1_probe().
Add the corresponding clk_disable_unprepare(), as already done in the
remove function.
Fixes: b17336c55d ("iommu/mediatek: add support for mtk iommu generation one HW")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Link: https://lore.kernel.org/r/593e7b7d97c6e064b29716b091a9d4fd122241fb.1671473163.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Joerg Roedel <jroedel@suse.de>
In __alloc_and_insert_iova_range, there is an issue that retry_pfn
overflows. The value of iovad->anchor.pfn_hi is ~0UL, then when
iovad->cached_node is iovad->anchor, curr_iova->pfn_hi + 1 will
overflow. As a result, if the retry logic is executed, low_pfn is
updated to 0, and then new_pfn < low_pfn returns false to make the
allocation successful.
This issue occurs in the following two situations:
1. The first iova size exceeds the domain size. When initializing
iova domain, iovad->cached_node is assigned as iovad->anchor. For
example, the iova domain size is 10M, start_pfn is 0x1_F000_0000,
and the iova size allocated for the first time is 11M. The
following is the log information, new->pfn_lo is smaller than
iovad->cached_node.
Example log as follows:
[ 223.798112][T1705487] sh: [name:iova&]__alloc_and_insert_iova_range
start_pfn:0x1f0000,retry_pfn:0x0,size:0xb00,limit_pfn:0x1f0a00
[ 223.799590][T1705487] sh: [name:iova&]__alloc_and_insert_iova_range
success start_pfn:0x1f0000,new->pfn_lo:0x1efe00,new->pfn_hi:0x1f08ff
2. The node with the largest iova->pfn_lo value in the iova domain
is deleted, iovad->cached_node will be updated to iovad->anchor,
and then the alloc iova size exceeds the maximum iova size that can
be allocated in the domain.
After judging that retry_pfn is less than limit_pfn, call retry_pfn+1
to fix the overflow issue.
Signed-off-by: jianjiao zeng <jianjiao.zeng@mediatek.com>
Signed-off-by: Yunfei Wang <yf.wang@mediatek.com>
Cc: <stable@vger.kernel.org> # 5.15.*
Fixes: 4e89dce725 ("iommu/iova: Retry from last rb tree node if iova search fails")
Acked-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20230111063801.25107-1-yf.wang@mediatek.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Similar to SMMUv2, this driver calls iommu_device_unregister() from the
shutdown path, which removes the IOMMU groups with no coordination
whatsoever with their users - shutdown methods are optional in device
drivers. This can lead to NULL pointer dereferences in those drivers'
DMA API calls, or worse.
Instead of calling the full arm_smmu_device_remove() from
arm_smmu_device_shutdown(), let's pick only the relevant function call -
arm_smmu_device_disable() - more or less the reverse of
arm_smmu_device_reset() - and call just that from the shutdown path.
Fixes: 57365a04c9 ("iommu: Move bus setup to IOMMU device registration")
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20221215141251.3688780-2-vladimir.oltean@nxp.com
Signed-off-by: Will Deacon <will@kernel.org>
Michael Walle says he noticed the following stack trace while performing
a shutdown with "reboot -f". He suggests he got "lucky" and just hit the
correct spot for the reboot while there was a packet transmission in
flight.
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000098
CPU: 0 PID: 23 Comm: kworker/0:1 Not tainted 6.1.0-rc5-00088-gf3600ff8e322 #1930
Hardware name: Kontron KBox A-230-LS (DT)
pc : iommu_get_dma_domain+0x14/0x20
lr : iommu_dma_map_page+0x9c/0x254
Call trace:
iommu_get_dma_domain+0x14/0x20
dma_map_page_attrs+0x1ec/0x250
enetc_start_xmit+0x14c/0x10b0
enetc_xmit+0x60/0xdc
dev_hard_start_xmit+0xb8/0x210
sch_direct_xmit+0x11c/0x420
__dev_queue_xmit+0x354/0xb20
ip6_finish_output2+0x280/0x5b0
__ip6_finish_output+0x15c/0x270
ip6_output+0x78/0x15c
NF_HOOK.constprop.0+0x50/0xd0
mld_sendpack+0x1bc/0x320
mld_ifc_work+0x1d8/0x4dc
process_one_work+0x1e8/0x460
worker_thread+0x178/0x534
kthread+0xe0/0xe4
ret_from_fork+0x10/0x20
Code: d503201f f9416800 d503233f d50323bf (f9404c00)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception in interrupt
This appears to be reproducible when the board has a fixed IP address,
is ping flooded from another host, and "reboot -f" is used.
The following is one more manifestation of the issue:
$ reboot -f
kvm: exiting hardware virtualization
cfg80211: failed to load regulatory.db
arm-smmu 5000000.iommu: disabling translation
sdhci-esdhc 2140000.mmc: Removing from iommu group 11
sdhci-esdhc 2150000.mmc: Removing from iommu group 12
fsl-edma 22c0000.dma-controller: Removing from iommu group 17
dwc3 3100000.usb: Removing from iommu group 9
dwc3 3110000.usb: Removing from iommu group 10
ahci-qoriq 3200000.sata: Removing from iommu group 2
fsl-qdma 8380000.dma-controller: Removing from iommu group 20
platform f080000.display: Removing from iommu group 0
etnaviv-gpu f0c0000.gpu: Removing from iommu group 1
etnaviv etnaviv: Removing from iommu group 1
caam_jr 8010000.jr: Removing from iommu group 13
caam_jr 8020000.jr: Removing from iommu group 14
caam_jr 8030000.jr: Removing from iommu group 15
caam_jr 8040000.jr: Removing from iommu group 16
fsl_enetc 0000:00:00.0: Removing from iommu group 4
arm-smmu 5000000.iommu: Blocked unknown Stream ID 0x429; boot with "arm-smmu.disable_bypass=0" to allow, but this may have security implications
arm-smmu 5000000.iommu: GFSR 0x80000002, GFSYNR0 0x00000002, GFSYNR1 0x00000429, GFSYNR2 0x00000000
fsl_enetc 0000:00:00.1: Removing from iommu group 5
arm-smmu 5000000.iommu: Blocked unknown Stream ID 0x429; boot with "arm-smmu.disable_bypass=0" to allow, but this may have security implications
arm-smmu 5000000.iommu: GFSR 0x80000002, GFSYNR0 0x00000002, GFSYNR1 0x00000429, GFSYNR2 0x00000000
arm-smmu 5000000.iommu: Blocked unknown Stream ID 0x429; boot with "arm-smmu.disable_bypass=0" to allow, but this may have security implications
arm-smmu 5000000.iommu: GFSR 0x80000002, GFSYNR0 0x00000000, GFSYNR1 0x00000429, GFSYNR2 0x00000000
fsl_enetc 0000:00:00.2: Removing from iommu group 6
fsl_enetc_mdio 0000:00:00.3: Removing from iommu group 8
mscc_felix 0000:00:00.5: Removing from iommu group 3
fsl_enetc 0000:00:00.6: Removing from iommu group 7
pcieport 0001:00:00.0: Removing from iommu group 18
arm-smmu 5000000.iommu: Blocked unknown Stream ID 0x429; boot with "arm-smmu.disable_bypass=0" to allow, but this may have security implications
arm-smmu 5000000.iommu: GFSR 0x00000002, GFSYNR0 0x00000000, GFSYNR1 0x00000429, GFSYNR2 0x00000000
pcieport 0002:00:00.0: Removing from iommu group 19
Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a8
pc : iommu_get_dma_domain+0x14/0x20
lr : iommu_dma_unmap_page+0x38/0xe0
Call trace:
iommu_get_dma_domain+0x14/0x20
dma_unmap_page_attrs+0x38/0x1d0
enetc_unmap_tx_buff.isra.0+0x6c/0x80
enetc_poll+0x170/0x910
__napi_poll+0x40/0x1e0
net_rx_action+0x164/0x37c
__do_softirq+0x128/0x368
run_ksoftirqd+0x68/0x90
smpboot_thread_fn+0x14c/0x190
Code: d503201f f9416800 d503233f d50323bf (f9405400)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception in interrupt
---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
The problem seems to be that iommu_group_remove_device() is allowed to
run with no coordination whatsoever with the shutdown procedure of the
enetc PCI device. In fact, it almost seems as if it implies that the
pci_driver :: shutdown() method is mandatory if DMA is used with an
IOMMU, otherwise this is inevitable. That was never the case; shutdown
methods are optional in device drivers.
This is the call stack that leads to iommu_group_remove_device() during
reboot:
kernel_restart
-> device_shutdown
-> platform_shutdown
-> arm_smmu_device_shutdown
-> arm_smmu_device_remove
-> iommu_device_unregister
-> bus_for_each_dev
-> remove_iommu_group
-> iommu_release_device
-> iommu_group_remove_device
I don't know much about the arm_smmu driver, but
arm_smmu_device_shutdown() invoking arm_smmu_device_remove() looks
suspicious, since it causes the IOMMU device to unregister and that's
where everything starts to unravel. It forces all other devices which
depend on IOMMU groups to also point their ->shutdown() to ->remove(),
which will make reboot slower overall.
There are 2 moments relevant to this behavior. First was commit
b06c076ea9 ("Revert "iommu/arm-smmu: Make arm-smmu explicitly
non-modular"") when arm_smmu_device_shutdown() was made to run the exact
same thing as arm_smmu_device_remove(). Prior to that, there was no
iommu_device_unregister() call in arm_smmu_device_shutdown(). However,
that was benign until commit 57365a04c9 ("iommu: Move bus setup to
IOMMU device registration"), which made iommu_device_unregister() call
remove_iommu_group().
Restore the old shutdown behavior by making remove() call shutdown(),
but shutdown() does not call the remove() specific bits.
Fixes: 57365a04c9 ("iommu: Move bus setup to IOMMU device registration")
Reported-by: Michael Walle <michael@walle.cc>
Tested-by: Michael Walle <michael@walle.cc> # on kontron-sl28
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20221215141251.3688780-1-vladimir.oltean@nxp.com
Signed-off-by: Will Deacon <will@kernel.org>
Although it's vanishingly unlikely that anyone would integrate an SMMU
within a coherent interconnect without also making the pagetable walk
interface coherent, the same effect happens if a coherent SMMU fails to
advertise CTTW correctly. This turns out to be the case on some popular
NXP SoCs, where VFIO started failing the IOMMU_CAP_CACHE_COHERENCY test,
even though IOMMU_CACHE *was* previously achieving the desired effect
anyway thanks to the underlying integration.
While those SoCs stand to gain some more general benefits from a
firmware update to override CTTW correctly in DT/ACPI, it's also easy
to work around this in Linux as well, to avoid imposing too much on
affected users - since the upstream client devices *are* correctly
marked as coherent, we can trivially infer their coherent paths through
the SMMU as well.
Reported-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Fixes: df198b37e7 ("iommu/arm-smmu: Report IOMMU_CAP_CACHE_COHERENCY better")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/d6dc41952961e5c7b21acac08a8bf1eb0f69e124.1671123115.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
No iommu driver implements this any more, get rid of it.
Link: https://lore.kernel.org/r/9-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
s390 doesn't use irq_domains, so it has no place to set
IRQ_DOMAIN_FLAG_ISOLATED_MSI. Instead of continuing to abuse the iommu
subsystem to convey this information add a simple define which s390 can
make statically true. The define will cause msi_device_has_isolated() to
return true.
Remove IOMMU_CAP_INTR_REMAP from the s390 iommu driver.
Link: https://lore.kernel.org/r/8-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
On x86 platforms when the HW can support interrupt remapping the iommu
driver creates an irq_domain for the IR hardware and creates a child MSI
irq_domain.
When the global irq_remapping_enabled is set, the IR MSI domain is
assigned to the PCI devices (by intel_irq_remap_add_device(), or
amd_iommu_set_pci_msi_domain()) making those devices have the isolated MSI
property.
Due to how interrupt domains work, setting IRQ_DOMAIN_FLAG_ISOLATED_MSI on
the parent IR domain will cause all struct devices attached to it to
return true from msi_device_has_isolated_msi(). This replaces the
IOMMU_CAP_INTR_REMAP flag as all places using IOMMU_CAP_INTR_REMAP also
call msi_device_has_isolated_msi()
Set the flag and delete the cap.
Link: https://lore.kernel.org/r/7-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Trivially use the new API.
Link: https://lore.kernel.org/r/4-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Compute the isolated_msi over all the devices in the IOMMU group because
iommufd and vfio both need to know that the entire group is isolated
before granting access to it.
Link: https://lore.kernel.org/r/2-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Including:
- Core code:
- map/unmap_pages() cleanup
- SVA and IOPF refactoring
- Clean up and document return codes from device/domain
attachment code
- AMD driver:
- Rework and extend parsing code for ivrs_ioapic, ivrs_hpet
and ivrs_acpihid command line options
- Some smaller cleanups
- Intel driver:
- Blocking domain support
- Cleanups
- S390 driver:
- Fixes and improvements for attach and aperture handling
- PAMU driver:
- Resource leak fix and cleanup
- Rockchip driver:
- Page table permission bit fix
- Mediatek driver:
- Improve safety from invalid dts input
- Smaller fixes and improvements
- Exynos driver:
- Fix driver initialization sequence
- Sun50i driver:
- Remove IOMMU_DOMAIN_IDENTITY as it has not been working
forever
- Various other fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmOd1PQACgkQK/BELZcB
GuO7NxAAiwJUO99pTwvqnByzcC783AuE/fqKHDb9DZaN6Cr0VXSbKEwm8Lc2PC00
2CTwK/zGhy8BKBQnPiooJ+YOMPjE4yhFIF9jr5ASH5AVWv8EEFpo8zIFKAcF5rh/
c2Y5RIUwsGXuhR7U3lMTw84r39TZG2eHPwTEU6KvEJ1LCOMyD8IBYrZK2rvpGpem
3swXUfF5bQGAT8LlIFN7p+qsVs6ZtuD40qre3kerjrBtCPUMlxIIV5TJ8oQTecsk
vKpD51mEVW+rjUKvqui8NDYuPfT76F2FPS37dfA1F36p8dmsMGSrtWngNm73r546
AmY8Gui6wKsv4Qn7Mxv49f/WZIXzdRTXOKx/zhYvvGxu7keqQIRIWYcLSxqfaGku
cqJT401Ws1NHmRpx/t90lMH/anY5+kUMRTQG9Iq5ruLhExskd0SJcffa1i7YIGIe
lPCTDf7MOXfDudR0Dtp87pGZQBaSkrSzZvb7qZY3Bj83WGZnLPpl6Z3N8KbkGzEO
zNNvv1CtxZnIPrdOaKvfxQlAKiWKxkPRHuqk1TE8hkoNOe5ZgdOSJP5SeCrZ5tEf
qljPXvDVF9f8CYw7QlfEDnbLnqDMGZpPAGqKPItbaijQLPZx4Jm4dw6+7i9hETIa
wJ+1R9iAf+qiR0rlqueALKRaI4DjE8RU8yYSDpn2kn0BUOhWmb8=
=ZM/m
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core code:
- map/unmap_pages() cleanup
- SVA and IOPF refactoring
- Clean up and document return codes from device/domain attachment
AMD driver:
- Rework and extend parsing code for ivrs_ioapic, ivrs_hpet and
ivrs_acpihid command line options
- Some smaller cleanups
Intel driver:
- Blocking domain support
- Cleanups
S390 driver:
- Fixes and improvements for attach and aperture handling
PAMU driver:
- Resource leak fix and cleanup
Rockchip driver:
- Page table permission bit fix
Mediatek driver:
- Improve safety from invalid dts input
- Smaller fixes and improvements
Exynos driver:
- Fix driver initialization sequence
Sun50i driver:
- Remove IOMMU_DOMAIN_IDENTITY as it has not been working forever
- Various other fixes"
* tag 'iommu-updates-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (74 commits)
iommu/mediatek: Fix forever loop in error handling
iommu/mediatek: Fix crash on isr after kexec()
iommu/sun50i: Remove IOMMU_DOMAIN_IDENTITY
iommu/amd: Fix typo in macro parameter name
iommu/mediatek: Remove unused "mapping" member from mtk_iommu_data
iommu/mediatek: Improve safety for mediatek,smi property in larb nodes
iommu/mediatek: Validate number of phandles associated with "mediatek,larbs"
iommu/mediatek: Add error path for loop of mm_dts_parse
iommu/mediatek: Use component_match_add
iommu/mediatek: Add platform_device_put for recovering the device refcnt
iommu/fsl_pamu: Fix resource leak in fsl_pamu_probe()
iommu/vt-d: Use real field for indication of first level
iommu/vt-d: Remove unnecessary domain_context_mapped()
iommu/vt-d: Rename domain_add_dev_info()
iommu/vt-d: Rename iommu_disable_dev_iotlb()
iommu/vt-d: Add blocking domain support
iommu/vt-d: Add device_block_translation() helper
iommu/vt-d: Allocate pasid table in device probe path
iommu/amd: Check return value of mmu_notifier_register()
iommu/amd: Fix pci device refcount leak in ppr_notifier()
...
* Randomize the per-cpu entry areas
Cleanups:
* Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open
coding it
* Move to "native" set_memory_rox() helper
* Clean up pmd_get_atomic() and i386-PAE
* Remove some unused page table size macros
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOc53UACgkQaDWVMHDJ
krCUHw//SGZ+La0hLZLAiAiZTXLZZHpYkOmg1Oj1+11qSU11uZzTFqDpauhaKpRS
cJCSh+D+RXe5e2ipgt0+Zl0hESLt7pJf8258OE4ra0DL/IlyO9uqruAs9Kn3eRS/
Fk76nG8gdEU+JKJqpG02GqOLslYQuIy96n9hpuj1x25b614+uezPfC7S4XEat0NT
MbJQ+jnVDf16aJIJkzT+iSwhubDVeh+bSHeO0SSCzX23WLUqDeg5NvlyxoCHGbBh
UpUTWggV/0pYAkBKRHToeJs8qTWREwuuH/8JGewpe9A0tjdB5wyZfNL2PuracweN
9MauXC3T5f0+Ca4yIIaPq1fF7Ny/PR2dBFihk27rOD0N7tjaZxNwal2pB1sZcmvZ
+PAokjyTPVH5ZXjkMYGGAUe1jyjwr2+TgFSZxhTnDuGtyVQiY4pihGKOifLCX6tv
x6khvYeTBw7wfaDRtKEAf+2kLHYn+71HszHP/8bNKX9T03h+Zf0i1wdZu5xbM5Gc
VK2wR7bCC+UftJJYG0pldcHg2qaF19RBHK2tLwp7zngUv7lTbkKfkgKjre73KV2a
D4b76lrqdUMo6UYwYdw7WtDyarZS4OVLq2DcNhwwMddBCaX8kyN5a4AqwQlZYJ0u
dM+kuMofE8U3yMxmMhJimkZUsj09yLHIqfynY0jbAcU3nhKZZNY=
=wwVF
-----END PGP SIGNATURE-----
Merge tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Dave Hansen:
"New Feature:
- Randomize the per-cpu entry areas
Cleanups:
- Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it
- Move to "native" set_memory_rox() helper
- Clean up pmd_get_atomic() and i386-PAE
- Remove some unused page table size macros"
* tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
x86/mm: Ensure forced page table splitting
x86/kasan: Populate shadow for shared chunk of the CPU entry area
x86/kasan: Add helpers to align shadow addresses up and down
x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names
x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area
x86/mm: Recompute physical address for every page of per-CPU CEA mapping
x86/mm: Rename __change_page_attr_set_clr(.checkalias)
x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()
x86/mm: Untangle __change_page_attr_set_clr(.checkalias)
x86/mm: Add a few comments
x86/mm: Fix CR3_ADDR_MASK
x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros
mm: Convert __HAVE_ARCH_P..P_GET to the new style
mm: Remove pointless barrier() after pmdp_get_lockless()
x86/mm/pae: Get rid of set_64bit()
x86_64: Remove pointless set_64bit() usage
x86/mm/pae: Be consistent with pXXp_get_and_clear()
x86/mm/pae: Use WRITE_ONCE()
x86/mm/pae: Don't (ab)use atomic64
mm/gup: Fix the lockless PMD access
...
The use of set_64bit() in X86_64 only code is pretty pointless, seeing
how it's a direct assignment. Remove all this nonsense.
[nathanchance: unbreak irte]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221022114425.168036718%40infradead.org
iommufd is the user API to control the IOMMU subsystem as it relates to
managing IO page tables that point at user space memory.
It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
container) which is the VFIO specific interface for a similar idea.
We see a broad need for extended features, some being highly IOMMU device
specific:
- Binding iommu_domain's to PASID/SSID
- Userspace IO page tables, for ARM, x86 and S390
- Kernel bypassed invalidation of user page tables
- Re-use of the KVM page table in the IOMMU
- Dirty page tracking in the IOMMU
- Runtime Increase/Decrease of IOPTE size
- PRI support with faults resolved in userspace
Many of these HW features exist to support VM use cases - for instance the
combination of PASID, PRI and Userspace IO Page Tables allows an
implementation of DMA Shared Virtual Addressing (vSVA) within a
guest. Dirty tracking enables VM live migration with SRIOV devices and
PASID support allow creating "scalable IOV" devices, among other things.
As these features are fundamental to a VM platform they need to be
uniformly exposed to all the driver families that do DMA into VMs, which
is currently VFIO and VDPA.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCY5ct7wAKCRCFwuHvBreF
YZZ5AQDciXfcgXLt0UBEmWupNb0f/asT6tk717pdsKm8kAZMNAEAsIyLiKT5HqGl
s7fAu+CQ1pr9+9NKGevD+frw8Solsw4=
=jJkd
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd implementation from Jason Gunthorpe:
"iommufd is the user API to control the IOMMU subsystem as it relates
to managing IO page tables that point at user space memory.
It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
container) which is the VFIO specific interface for a similar idea.
We see a broad need for extended features, some being highly IOMMU
device specific:
- Binding iommu_domain's to PASID/SSID
- Userspace IO page tables, for ARM, x86 and S390
- Kernel bypassed invalidation of user page tables
- Re-use of the KVM page table in the IOMMU
- Dirty page tracking in the IOMMU
- Runtime Increase/Decrease of IOPTE size
- PRI support with faults resolved in userspace
Many of these HW features exist to support VM use cases - for instance
the combination of PASID, PRI and Userspace IO Page Tables allows an
implementation of DMA Shared Virtual Addressing (vSVA) within a guest.
Dirty tracking enables VM live migration with SRIOV devices and PASID
support allow creating "scalable IOV" devices, among other things.
As these features are fundamental to a VM platform they need to be
uniformly exposed to all the driver families that do DMA into VMs,
which is currently VFIO and VDPA"
For more background, see the extended explanations in Jason's pull request:
https://lore.kernel.org/lkml/Y5dzTU8dlmXTbzoJ@nvidia.com/
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (62 commits)
iommufd: Change the order of MSI setup
iommufd: Improve a few unclear bits of code
iommufd: Fix comment typos
vfio: Move vfio group specific code into group.c
vfio: Refactor dma APIs for emulated devices
vfio: Wrap vfio group module init/clean code into helpers
vfio: Refactor vfio_device open and close
vfio: Make vfio_device_open() truly device specific
vfio: Swap order of vfio_device_container_register() and open_device()
vfio: Set device->group in helper function
vfio: Create wrappers for group register/unregister
vfio: Move the sanity check of the group to vfio_create_group()
vfio: Simplify vfio_create_group()
iommufd: Allow iommufd to supply /dev/vfio/vfio
vfio: Make vfio_container optionally compiled
vfio: Move container related MODULE_ALIAS statements into container.c
vfio-iommufd: Support iommufd for emulated VFIO devices
vfio-iommufd: Support iommufd for physical VFIO devices
vfio-iommufd: Allow iommufd to be used in place of a container fd
vfio: Use IOMMU_CAP_ENFORCE_CACHE_COHERENCY for vfio_file_enforced_coherent()
...
- Core:
The bulk is the rework of the MSI subsystem to support per device MSI
interrupt domains. This solves conceptual problems of the current
PCI/MSI design which are in the way of providing support for PCI/MSI[-X]
and the upcoming PCI/IMS mechanism on the same device.
IMS (Interrupt Message Store] is a new specification which allows device
manufactures to provide implementation defined storage for MSI messages
contrary to the uniform and specification defined storage mechanisms for
PCI/MSI and PCI/MSI-X. IMS not only allows to overcome the size limitations
of the MSI-X table, but also gives the device manufacturer the freedom to
store the message in arbitrary places, even in host memory which is shared
with the device.
There have been several attempts to glue this into the current MSI code,
but after lengthy discussions it turned out that there is a fundamental
design problem in the current PCI/MSI-X implementation. This needs some
historical background.
When PCI/MSI[-X] support was added around 2003, interrupt management was
completely different from what we have today in the actively developed
architectures. Interrupt management was completely architecture specific
and while there were attempts to create common infrastructure the
commonalities were rudimentary and just providing shared data structures and
interfaces so that drivers could be written in an architecture agnostic
way.
The initial PCI/MSI[-X] support obviously plugged into this model which
resulted in some basic shared infrastructure in the PCI core code for
setting up MSI descriptors, which are a pure software construct for holding
data relevant for a particular MSI interrupt, but the actual association to
Linux interrupts was completely architecture specific. This model is still
supported today to keep museum architectures and notorious stranglers
alive.
In 2013 Intel tried to add support for hot-pluggable IO/APICs to the kernel,
which was creating yet another architecture specific mechanism and resulted
in an unholy mess on top of the existing horrors of x86 interrupt handling.
The x86 interrupt management code was already an incomprehensible maze of
indirections between the CPU vector management, interrupt remapping and the
actual IO/APIC and PCI/MSI[-X] implementation.
At roughly the same time ARM struggled with the ever growing SoC specific
extensions which were glued on top of the architected GIC interrupt
controller.
This resulted in a fundamental redesign of interrupt management and
provided the today prevailing concept of hierarchical interrupt
domains. This allowed to disentangle the interactions between x86 vector
domain and interrupt remapping and also allowed ARM to handle the zoo of
SoC specific interrupt components in a sane way.
The concept of hierarchical interrupt domains aims to encapsulate the
functionality of particular IP blocks which are involved in interrupt
delivery so that they become extensible and pluggable. The X86
encapsulation looks like this:
|--- device 1
[Vector]---[Remapping]---[PCI/MSI]--|...
|--- device N
where the remapping domain is an optional component and in case that it is
not available the PCI/MSI[-X] domains have the vector domain as their
parent. This reduced the required interaction between the domains pretty
much to the initialization phase where it is obviously required to
establish the proper parent relation ship in the components of the
hierarchy.
While in most cases the model is strictly representing the chain of IP
blocks and abstracting them so they can be plugged together to form a
hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware
it's clear that the actual PCI/MSI[-X] interrupt controller is not a global
entity, but strict a per PCI device entity.
Here we took a short cut on the hierarchical model and went for the easy
solution of providing "global" PCI/MSI domains which was possible because
the PCI/MSI[-X] handling is uniform across the devices. This also allowed
to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in
turn made it simple to keep the existing architecture specific management
alive.
A similar problem was created in the ARM world with support for IP block
specific message storage. Instead of going all the way to stack a IP block
specific domain on top of the generic MSI domain this ended in a construct
which provides a "global" platform MSI domain which allows overriding the
irq_write_msi_msg() callback per allocation.
In course of the lengthy discussions we identified other abuse of the MSI
infrastructure in wireless drivers, NTB etc. where support for
implementation specific message storage was just mindlessly glued into the
existing infrastructure. Some of this just works by chance on particular
platforms but will fail in hard to diagnose ways when the driver is used
on platforms where the underlying MSI interrupt management code does not
expect the creative abuse.
Another shortcoming of today's PCI/MSI-X support is the inability to
allocate or free individual vectors after the initial enablement of
MSI-X. This results in an works by chance implementation of VFIO (PCI
pass-through) where interrupts on the host side are not set up upfront to
avoid resource exhaustion. They are expanded at run-time when the guest
actually tries to use them. The way how this is implemented is that the
host disables MSI-X and then re-enables it with a larger number of
vectors again. That works by chance because most device drivers set up
all interrupts before the device actually will utilize them. But that's
not universally true because some drivers allocate a large enough number
of vectors but do not utilize them until it's actually required,
e.g. for acceleration support. But at that point other interrupts of the
device might be in active use and the MSI-X disable/enable dance can
just result in losing interrupts and therefore hard to diagnose subtle
problems.
Last but not least the "global" PCI/MSI-X domain approach prevents to
utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact that IMS
is not longer providing a uniform storage and configuration model.
The solution to this is to implement the missing step and switch from
global PCI/MSI domains to per device PCI/MSI domains. The resulting
hierarchy then looks like this:
|--- [PCI/MSI] device 1
[Vector]---[Remapping]---|...
|--- [PCI/MSI] device N
which in turn allows to provide support for multiple domains per device:
|--- [PCI/MSI] device 1
|--- [PCI/IMS] device 1
[Vector]---[Remapping]---|...
|--- [PCI/MSI] device N
|--- [PCI/IMS] device N
This work converts the MSI and PCI/MSI core and the x86 interrupt
domains to the new model, provides new interfaces for post-enable
allocation/free of MSI-X interrupts and the base framework for PCI/IMS.
PCI/IMS has been verified with the work in progress IDXD driver.
There is work in progress to convert ARM over which will replace the
platform MSI train-wreck. The cleanup of VFIO, NTB and other creative
"solutions" are in the works as well.
- Drivers:
- Updates for the LoongArch interrupt chip drivers
- Support for MTK CIRQv2
- The usual small fixes and updates all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUsygTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYXiD/40tXKzCzf0qFIqUlZLia1N3RRrwrNC
DVTixuLtR9MrjwE+jWLQILa85SHInV8syXHSd35SzhsGDxkURFGi+HBgVWmysODf
br9VSh3Gi+kt7iXtIwAg8WNWviGNmS3kPksxCko54F0YnJhMY5r5bhQVUBQkwFG2
wES1C9Uzd4pdV2bl24Z+WKL85cSmZ+pHunyKw1n401lBABXnTF9c4f13zC14jd+y
wDxNrmOxeL3mEH4Pg6VyrDuTOURSf3TjJjeEq3EYqvUo0FyLt9I/cKX0AELcZQX7
fkRjrQQAvXNj39RJfeSkojDfllEPUHp7XSluhdBu5aIovSamdYGCDnuEoZ+l4MJ+
CojIErp3Dwj/uSaf5c7C3OaDAqH2CpOFWIcrUebShJE60hVKLEpUwd6W8juplaoT
gxyXRb1Y+BeJvO8VhMN4i7f3232+sj8wuj+HTRTTbqMhkElnin94tAx8rgwR1sgR
BiOGMJi4K2Y8s9Rqqp0Dvs01CW4guIYvSR4YY+WDbbi1xgiev89OYs6zZTJCJe4Y
NUwwpqYSyP1brmtdDdBOZLqegjQm+TwUb6oOaasFem4vT1swgawgLcDnPOx45bk5
/FWt3EmnZxMz99x9jdDn1+BCqAZsKyEbEY1avvhPVMTwoVIuSX2ceTBMLseGq+jM
03JfvdxnueM3gw==
=9erA
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for the interrupt core and driver subsystem:
The bulk is the rework of the MSI subsystem to support per device MSI
interrupt domains. This solves conceptual problems of the current
PCI/MSI design which are in the way of providing support for
PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device.
IMS (Interrupt Message Store] is a new specification which allows
device manufactures to provide implementation defined storage for MSI
messages (as opposed to PCI/MSI and PCI/MSI-X that has a specified
message store which is uniform accross all devices). The PCI/MSI[-X]
uniformity allowed us to get away with "global" PCI/MSI domains.
IMS not only allows to overcome the size limitations of the MSI-X
table, but also gives the device manufacturer the freedom to store the
message in arbitrary places, even in host memory which is shared with
the device.
There have been several attempts to glue this into the current MSI
code, but after lengthy discussions it turned out that there is a
fundamental design problem in the current PCI/MSI-X implementation.
This needs some historical background.
When PCI/MSI[-X] support was added around 2003, interrupt management
was completely different from what we have today in the actively
developed architectures. Interrupt management was completely
architecture specific and while there were attempts to create common
infrastructure the commonalities were rudimentary and just providing
shared data structures and interfaces so that drivers could be written
in an architecture agnostic way.
The initial PCI/MSI[-X] support obviously plugged into this model
which resulted in some basic shared infrastructure in the PCI core
code for setting up MSI descriptors, which are a pure software
construct for holding data relevant for a particular MSI interrupt,
but the actual association to Linux interrupts was completely
architecture specific. This model is still supported today to keep
museum architectures and notorious stragglers alive.
In 2013 Intel tried to add support for hot-pluggable IO/APICs to the
kernel, which was creating yet another architecture specific mechanism
and resulted in an unholy mess on top of the existing horrors of x86
interrupt handling. The x86 interrupt management code was already an
incomprehensible maze of indirections between the CPU vector
management, interrupt remapping and the actual IO/APIC and PCI/MSI[-X]
implementation.
At roughly the same time ARM struggled with the ever growing SoC
specific extensions which were glued on top of the architected GIC
interrupt controller.
This resulted in a fundamental redesign of interrupt management and
provided the today prevailing concept of hierarchical interrupt
domains. This allowed to disentangle the interactions between x86
vector domain and interrupt remapping and also allowed ARM to handle
the zoo of SoC specific interrupt components in a sane way.
The concept of hierarchical interrupt domains aims to encapsulate the
functionality of particular IP blocks which are involved in interrupt
delivery so that they become extensible and pluggable. The X86
encapsulation looks like this:
|--- device 1
[Vector]---[Remapping]---[PCI/MSI]--|...
|--- device N
where the remapping domain is an optional component and in case that
it is not available the PCI/MSI[-X] domains have the vector domain as
their parent. This reduced the required interaction between the
domains pretty much to the initialization phase where it is obviously
required to establish the proper parent relation ship in the
components of the hierarchy.
While in most cases the model is strictly representing the chain of IP
blocks and abstracting them so they can be plugged together to form a
hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the
hardware it's clear that the actual PCI/MSI[-X] interrupt controller
is not a global entity, but strict a per PCI device entity.
Here we took a short cut on the hierarchical model and went for the
easy solution of providing "global" PCI/MSI domains which was possible
because the PCI/MSI[-X] handling is uniform across the devices. This
also allowed to keep the existing PCI/MSI[-X] infrastructure mostly
unchanged which in turn made it simple to keep the existing
architecture specific management alive.
A similar problem was created in the ARM world with support for IP
block specific message storage. Instead of going all the way to stack
a IP block specific domain on top of the generic MSI domain this ended
in a construct which provides a "global" platform MSI domain which
allows overriding the irq_write_msi_msg() callback per allocation.
In course of the lengthy discussions we identified other abuse of the
MSI infrastructure in wireless drivers, NTB etc. where support for
implementation specific message storage was just mindlessly glued into
the existing infrastructure. Some of this just works by chance on
particular platforms but will fail in hard to diagnose ways when the
driver is used on platforms where the underlying MSI interrupt
management code does not expect the creative abuse.
Another shortcoming of today's PCI/MSI-X support is the inability to
allocate or free individual vectors after the initial enablement of
MSI-X. This results in an works by chance implementation of VFIO (PCI
pass-through) where interrupts on the host side are not set up upfront
to avoid resource exhaustion. They are expanded at run-time when the
guest actually tries to use them. The way how this is implemented is
that the host disables MSI-X and then re-enables it with a larger
number of vectors again. That works by chance because most device
drivers set up all interrupts before the device actually will utilize
them. But that's not universally true because some drivers allocate a
large enough number of vectors but do not utilize them until it's
actually required, e.g. for acceleration support. But at that point
other interrupts of the device might be in active use and the MSI-X
disable/enable dance can just result in losing interrupts and
therefore hard to diagnose subtle problems.
Last but not least the "global" PCI/MSI-X domain approach prevents to
utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact
that IMS is not longer providing a uniform storage and configuration
model.
The solution to this is to implement the missing step and switch from
global PCI/MSI domains to per device PCI/MSI domains. The resulting
hierarchy then looks like this:
|--- [PCI/MSI] device 1
[Vector]---[Remapping]---|...
|--- [PCI/MSI] device N
which in turn allows to provide support for multiple domains per
device:
|--- [PCI/MSI] device 1
|--- [PCI/IMS] device 1
[Vector]---[Remapping]---|...
|--- [PCI/MSI] device N
|--- [PCI/IMS] device N
This work converts the MSI and PCI/MSI core and the x86 interrupt
domains to the new model, provides new interfaces for post-enable
allocation/free of MSI-X interrupts and the base framework for
PCI/IMS. PCI/IMS has been verified with the work in progress IDXD
driver.
There is work in progress to convert ARM over which will replace the
platform MSI train-wreck. The cleanup of VFIO, NTB and other creative
"solutions" are in the works as well.
Drivers:
- Updates for the LoongArch interrupt chip drivers
- Support for MTK CIRQv2
- The usual small fixes and updates all over the place"
* tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (134 commits)
irqchip/ti-sci-inta: Fix kernel doc
irqchip/gic-v2m: Mark a few functions __init
irqchip/gic-v2m: Include arm-gic-common.h
irqchip/irq-mvebu-icu: Fix works by chance pointer assignment
iommu/amd: Enable PCI/IMS
iommu/vt-d: Enable PCI/IMS
x86/apic/msi: Enable PCI/IMS
PCI/MSI: Provide pci_ims_alloc/free_irq()
PCI/MSI: Provide IMS (Interrupt Message Store) support
genirq/msi: Provide constants for PCI/IMS support
x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN
PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X
PCI/MSI: Provide prepare_desc() MSI domain op
PCI/MSI: Split MSI-X descriptor setup
genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN
genirq/msi: Provide msi_domain_alloc_irq_at()
genirq/msi: Provide msi_domain_ops:: Prepare_desc()
genirq/msi: Provide msi_desc:: Msi_data
genirq/msi: Provide struct msi_map
x86/apic/msi: Remove arch_create_remap_msi_irq_domain()
...
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmORzR4THHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXkqCCACFwHz04iepLE7R8ZZ6BVUhD6uzfzDo
s1j7ozOUGUe3vI6q0DElHWVQZgzIzLypVsfWkZToe6jeOU6R48b0tZSFyJCUNwGM
ogmS7N8fBdHfY9SBFoUPoziBifXpf3kq4hhX/w+1Lge9CN5Ywc4KjuJb91EAInbs
lm47O4KQY8w8A7BbPBHYBueUVWLvgwPRPOS032zqxN1787m2tCxpqkfnImK39kh6
IsBBIZfYsok0H5wldhZXnsARpEOeFF6BoFBXpFPlmnbv2VcK2AfZgTYdA3ESyEgd
NyOFDfh6BO07gTR1xCH6gvOpkHwx6xKAkjE36RymdhXS6fhRCRsfahVB
=m78g
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20221208' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Drop unregister syscore from hyperv_cleanup to avoid hang (Gaurav
Kohli)
- Clean up panic path for Hyper-V framebuffer (Guilherme G. Piccoli)
- Allow IRQ remapping to work without x2apic (Nuno Das Neves)
- Fix comments (Olaf Hering)
- Expand hv_vp_assist_page definition (Saurabh Sengar)
- Improvement to page reporting (Shradha Gupta)
- Make sure TSC clocksource works when Linux runs as the root partition
(Stanislav Kinsburskiy)
* tag 'hyperv-next-signed-20221208' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Remove unregister syscore call from Hyper-V cleanup
iommu/hyper-v: Allow hyperv irq remapping without x2apic
clocksource: hyper-v: Add TSC page support for root partition
clocksource: hyper-v: Use TSC PFN getter to map vvar page
clocksource: hyper-v: Introduce TSC PFN getter
clocksource: hyper-v: Introduce a pointer to TSC page
x86/hyperv: Expand definition of struct hv_vp_assist_page
PCI: hv: update comment in x86 specific hv_arch_irq_unmask
hv: fix comment typo in vmbus_channel/low_latency
drivers: hv, hyperv_fb: Untangle and refactor Hyper-V panic notifiers
video: hyperv_fb: Avoid taking busy spinlock on panic path
hv_balloon: Add support for configurable order free page reporting
mm/page_reporting: Add checks for page_reporting_order param
There is a typo so this loop does i++ where i-- was intended. It will
result in looping until the kernel crashes.
Fixes: 2659392856 ("iommu/mediatek: Add error path for loop of mm_dts_parse")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Link: https://lore.kernel.org/r/Y5C3mTam2nkbaz6o@kili
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Eric points out this is wrong for the rare case of someone using
allow_unsafe_interrupts on ARM. We always have to setup the MSI window in
the domain if the iommu driver asks for it.
Move the iommu_get_msi_cookie() setup to the top of the function and
always do it, regardless of the security mode. Add checks to
iommufd_device_setup_msi() to ensure the driver is not doing something
incomprehensible. No current driver will set both a HW and SW MSI window,
or have more than one SW MSI window.
Fixes: e8d5721003 ("iommufd: Add kAPI toward external drivers for physical devices")
Link: https://lore.kernel.org/r/3-v1-0362a1a1c034+98-iommufd_fixes1_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reported-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Correct a few items noticed late in review:
- We should assert that the math in batch_clear_carry() doesn't underflow
- user->locked should be -1 not 0 sicne we just did mmput
- npages should not have been recalculated, it already has that value
No functional change.
Fixes: 8d160cd4d5 ("iommufd: Algorithms for PFN storage")
Link: https://lore.kernel.org/r/2-v1-0362a1a1c034+98-iommufd_fixes1_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Repair some typos in comments that were noticed late in the review
cycle.
Fixes: f394576eb1 ("iommufd: PFN handling for iopt_pages")
Link: https://lore.kernel.org/r/1-v1-0362a1a1c034+98-iommufd_fixes1_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
PCI/IMS works like PCI/MSI-X in the remapping. Just add the feature flag,
but only when on real hardware.
Virtualized IOMMUs need additional support.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232327.140571546@linutronix.de
PCI/IMS works like PCI/MSI-X in the remapping. Just add the feature flag,
but only when on real hardware.
Virtualized IOMMUs need additional support, e.g. for PASID.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232327.081482253@linutronix.de
Remove the global PCI/MSI irqdomain implementation and provide the required
MSI parent ops so the PCI/MSI code can detect the new parent and setup per
device domains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.209212272@linutronix.de
Remove the global PCI/MSI irqdomain implementation and provide the required
MSI parent ops so the PCI/MSI code can detect the new parent and setup per
device domains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.151226317@linutronix.de
Enable MSI parent domain support in the x86 vector domain and fixup the
checks in the iommu implementations to check whether device::msi::domain is
the default MSI parent domain. That keeps the existing logic to protect
e.g. devices behind VMD working.
The interrupt remap PCI/MSI code still works because the underlying vector
domain still provides the same functionality.
None of the other x86 PCI/MSI, e.g. XEN and HyperV, implementations are
affected either. They still work the same way both at the low level and the
PCI/MSI implementations they provide.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.034672592@linutronix.de
Impacted QAT device IDs that need extra dtlb flush quirk is ranging
from 0x4940 to 0x4943. After bitwise AND device ID with 0xfffc the
result should be 0x4940 instead of 0x494c to identify these devices.
Fixes: e65a6897be ("iommu/vt-d: Add a fix for devices need extra dtlb flush")
Reported-by: Raghunathan Srinivasan <raghunathan.srinivasan@intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Link: https://lore.kernel.org/r/20221203005610.2927487-1-jacob.jun.pan@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
If the system is rebooted via isr(), the IRQ handler might
be triggered before the domain is initialized. Resulting on
an invalid memory access error.
Fix:
[ 0.500930] Unable to handle kernel read from unreadable memory at virtual address 0000000000000070
[ 0.501166] Call trace:
[ 0.501174] report_iommu_fault+0x28/0xfc
[ 0.501180] mtk_iommu_isr+0x10c/0x1c0
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20221125-mtk-iommu-v2-0-e168dff7d43e@chromium.org
[ joro: Fixed spelling in commit message ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This driver treats IOMMU_DOMAIN_IDENTITY the same as UNMANAGED, which
cannot possibly be correct.
UNMANAGED domains are required to start out blocking all DMAs. This seems
to be what this driver does as it allocates a first level 'dt' for the IO
page table that is 0 filled.
Thus UNMANAGED looks like a working IO page table, and so IDENTITY must be
a mistake. Remove it.
Fixes: 4100b8c229 ("iommu: Add Allwinner H6 IOMMU driver")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/0-v1-97f0adf27b5e+1f0-s50_identity_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Just remove a unused variable that only is for mtk_iommu_v1.
Signed-off-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Link: https://lore.kernel.org/r/20221018024258.19073-7-yong.wu@mediatek.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
No functional change. Just improve safety from dts.
All the larbs that connect to one IOMMU must connect with the same
smi-common. This patch checks all the mediatek,smi property for each
larb, If their mediatek,smi are different, it will return fails.
Also avoid there is no available smi-larb nodes.
Suggested-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Link: https://lore.kernel.org/r/20221018024258.19073-6-yong.wu@mediatek.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Fix the smatch warnings:
drivers/iommu/mtk_iommu.c:878 mtk_iommu_mm_dts_parse() error: uninitialized
symbol 'larbnode'.
If someone abuse the dtsi node(Don't follow the definition of dt-binding),
for example "mediatek,larbs" is provided as boolean property, "larb_nr"
will be zero and cause abnormal.
To fix this problem and improve the code safety, add some checking
for the invalid input from dtsi, e.g. checking the larb_nr/larbid valid
range, and avoid "mediatek,larb-id" property conflicts in the smi-larb
nodes.
Fixes: d2e9a1102c ("iommu/mediatek: Contain MM IOMMU flow with the MM TYPE")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Link: https://lore.kernel.org/r/20221018024258.19073-5-yong.wu@mediatek.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The mtk_iommu_mm_dts_parse will parse the smi larbs nodes. if the i+1
larb is parsed fail, we should put_device for the i..0 larbs.
There are two places need to comment:
1) The larbid may be not linear mapping, we should loop whole
the array in the error path.
2) I move this line position: "data->larb_imu[id].dev = &plarbdev->dev;"
before "if (!plarbdev->dev.driver)", That means set
data->larb_imu[id].dev before the error path. then we don't need
"platform_device_put(plarbdev)" again in probe_defer case. All depend
on "put_device" of the error path in error cases.
Fixes: d2e9a1102c ("iommu/mediatek: Contain MM IOMMU flow with the MM TYPE")
Signed-off-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Link: https://lore.kernel.org/r/20221018024258.19073-4-yong.wu@mediatek.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
In order to simplify the error patch(avoid call of_node_put), Use
component_match_add instead component_match_add_release since we are only
interested in the "device" here. Then we could always call of_node_put in
normal path.
Strictly this is not a fixes patch, but it is a prepare for adding the
error path, thus I add a Fixes tag too.
Fixes: d2e9a1102c ("iommu/mediatek: Contain MM IOMMU flow with the MM TYPE")
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Link: https://lore.kernel.org/r/20221018024258.19073-3-yong.wu@mediatek.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add platform_device_put to match with of_find_device_by_node.
Meanwhile, I add a new variable "pcommdev" which is for smi common device.
Otherwise, "platform_device_put(plarbdev)" for smi-common dev may be not
readable. And add a checking for whether pcommdev is NULL.
Fixes: d2e9a1102c ("iommu/mediatek: Contain MM IOMMU flow with the MM TYPE")
Signed-off-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Link: https://lore.kernel.org/r/20221018024258.19073-2-yong.wu@mediatek.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Resolve conflicts in drivers/vfio/vfio_main.c by using the iommfd version.
The rc fix was done a different way when iommufd patches reworked this
code.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If the VFIO container is compiled out, give a kconfig option for iommufd
to provide the miscdev node with the same name and permissions as vfio
uses.
The compatibility node supports the same ioctls as VFIO and automatically
enables the VFIO compatible pinned page accounting mode.
Link: https://lore.kernel.org/r/10-v4-42cd2eb0e3eb+335a-vfio_iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Yu He <yu.he@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
for_each_pci_dev() is implemented by pci_get_device(). The comment of
pci_get_device() says that it will increase the reference count for the
returned pci_dev and also decrease the reference count for the input
pci_dev @from if it is not NULL.
If we break for_each_pci_dev() loop with pdev not NULL, we need to call
pci_dev_put() to decrease the reference count. Add the missing
pci_dev_put() for the error path to avoid reference count leak.
Fixes: 2e45528930 ("iommu/vt-d: Unify the way to process DMAR device scope array")
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Link: https://lore.kernel.org/r/20221121113649.190393-3-wangxiongfeng2@huawei.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
for_each_pci_dev() is implemented by pci_get_device(). The comment of
pci_get_device() says that it will increase the reference count for the
returned pci_dev and also decrease the reference count for the input
pci_dev @from if it is not NULL.
If we break for_each_pci_dev() loop with pdev not NULL, we need to call
pci_dev_put() to decrease the reference count. Add the missing
pci_dev_put() before 'return true' to avoid reference count leak.
Fixes: 89a6079df7 ("iommu/vt-d: Force IOMMU on for platform opt in hint")
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Link: https://lore.kernel.org/r/20221121113649.190393-2-wangxiongfeng2@huawei.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
As comment of pci_get_domain_bus_and_slot() says, it returns a pci device
with refcount increment, when finish using it, the caller must decrease
the reference count by calling pci_dev_put(). So call pci_dev_put() after
using the 'pdev' to avoid refcount leak.
Besides, if the 'pdev' is null or intel_svm_prq_report() returns error,
there is no need to trace this fault.
Fixes: 06f4b8d09d ("iommu/vt-d: Remove unnecessary SVA data accesses in page fault path")
Suggested-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lore.kernel.org/r/20221119144028.2452731-1-yangyingliang@huawei.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
QAT devices on Intel Sapphire Rapids and Emerald Rapids have a defect in
address translation service (ATS). These devices may inadvertently issue
ATS invalidation completion before posted writes initiated with
translated address that utilized translations matching the invalidation
address range, violating the invalidation completion ordering.
This patch adds an extra device TLB invalidation for the affected devices,
it is needed to ensure no more posted writes with translated address
following the invalidation completion. Therefore, the ordering is
preserved and data-corruption is prevented.
Device TLBs are invalidated under the following six conditions:
1. Device driver does DMA API unmap IOVA
2. Device driver unbind a PASID from a process, sva_unbind_device()
3. PASID is torn down, after PASID cache is flushed. e.g. process
exit_mmap() due to crash
4. Under SVA usage, called by mmu_notifier.invalidate_range() where
VM has to free pages that were unmapped
5. userspace driver unmaps a DMA buffer
6. Cache invalidation in vSVA usage (upcoming)
For #1 and #2, device drivers are responsible for stopping DMA traffic
before unmap/unbind. For #3, iommu driver gets mmu_notifier to
invalidate TLB the same way as normal user unmap which will do an extra
invalidation. The dTLB invalidation after PASID cache flush does not
need an extra invalidation.
Therefore, we only need to deal with #4 and #5 in this patch. #1 is also
covered by this patch due to common code path with #5.
Tested-by: Yuzhang Luo <yuzhang.luo@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Link: https://lore.kernel.org/r/20221130062449.1360063-1-jacob.jun.pan@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
These are on performance paths so we protect them using the
CONFIG_IOMMUFD_TEST to not take a hit during normal operation.
These are useful when running the test suite and syzkaller to find data
structure inconsistencies early.
Link: https://lore.kernel.org/r/18-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Provide a mock kernel module for the iommu_domain that allows it to run
without any HW and the mocking provides a way to directly validate that
the PFNs loaded into the iommu_domain are correct. This exposes the access
kAPI toward userspace to allow userspace to explore the functionality of
pages.c and io_pagetable.c
The mock also simulates the rare case of PAGE_SIZE > iommu page size as
the mock will operate at a 2K iommu page size. This allows exercising all
of the calculations to support this mismatch.
This is also intended to support syzkaller exploring the same space.
However, it is an unusually invasive config option to enable all of
this. The config option should not be enabled in a production kernel.
Link: https://lore.kernel.org/r/16-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390
Tested-by: Eric Auger <eric.auger@redhat.com> # aarch64
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into io_pagetable operations.
A userspace application can test against iommufd and confirm compatibility
then simply make a small change to open /dev/iommu instead of
/dev/vfio/vfio.
For testing purposes /dev/vfio/vfio can be symlinked to /dev/iommu and
then all applications will use the compatibility path with no code
changes. A later series allows /dev/vfio/vfio to be directly provided by
iommufd, which allows the rlimit mode to work the same as well.
This series just provides the iommufd side of compatibility. Actually
linking this to VFIO_SET_CONTAINER is a followup series, with a link in
the cover letter.
Internally the compatibility API uses a normal IOAS object that, like
vfio, is automatically allocated when the first device is
attached.
Userspace can also query or set this IOAS object directly using the
IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
features while still using the VFIO style map/unmap ioctls.
While this is enough to operate qemu, it has a few differences:
- Resource limits rely on memory cgroups to bound what userspace can do
instead of the module parameter dma_entry_limit.
- VFIO P2P is not implemented. The DMABUF patches for vfio are a start at
a solution where iommufd would import a special DMABUF. This is to avoid
further propogating the follow_pfn() security problem.
- A full audit for pedantic compatibility details (eg errnos, etc) has
not yet been done
- powerpc SPAPR is left out, as it is not connected to the iommu_domain
framework. It seems interest in SPAPR is minimal as it is currently
non-working in v6.1-rc1. They will have to convert to the iommu
subsystem framework to enjoy iommfd.
The following are not going to be implemented and we expect to remove them
from VFIO type1:
- SW access 'dirty tracking'. As discussed in the cover letter this will
be done in VFIO.
- VFIO_TYPE1_NESTING_IOMMU
https://lore.kernel.org/all/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/
- VFIO_DMA_MAP_FLAG_VADDR
https://lore.kernel.org/all/Yz777bJZjTyLrHEQ@nvidia.com/
Link: https://lore.kernel.org/r/15-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Kernel access is the mode that VFIO "mdevs" use. In this case there is no
struct device and no IOMMU connection. iommufd acts as a record keeper for
accesses and returns the actual struct pages back to the caller to use
however they need. eg with kmap or the DMA API.
Each caller must create a struct iommufd_access with
iommufd_access_create(), similar to how iommufd_device_bind() works. Using
this struct the caller can access blocks of IOVA using
iommufd_access_pin_pages() or iommufd_access_rw().
Callers must provide a callback that immediately unpins any IOVA being
used within a range. This happens if userspace unmaps the IOVA under the
pin.
The implementation forwards the access requests directly to the iopt
infrastructure that manages the iopt_pages_access.
Link: https://lore.kernel.org/r/14-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add the four functions external drivers need to connect physical DMA to
the IOMMUFD:
iommufd_device_bind() / iommufd_device_unbind()
Register the device with iommufd and establish security isolation.
iommufd_device_attach() / iommufd_device_detach()
Connect a bound device to a page table
Binding a device creates a device object ID in the uAPI, however the
generic API does not yet provide any IOCTLs to manipulate them.
Link: https://lore.kernel.org/r/13-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The hw_pagetable object exposes the internal struct iommu_domain's to
userspace. An iommu_domain is required when any DMA device attaches to an
IOAS to control the io page table through the iommu driver.
For compatibility with VFIO the hw_pagetable is automatically created when
a DMA device is attached to the IOAS. If a compatible iommu_domain already
exists then the hw_pagetable associated with it is used for the
attachment.
In the initial series there is no iommufd uAPI for the hw_pagetable
object. The next patch provides driver facing APIs for IO page table
attachment that allows drivers to accept either an IOAS or a hw_pagetable
ID and for the driver to return the hw_pagetable ID that was auto-selected
from an IOAS. The expectation is the driver will provide uAPI through its
own FD for attaching its device to iommufd. This allows userspace to learn
the mapping of devices to iommu_domains and to override the automatic
attachment.
The future HW specific interface will allow userspace to create
hw_pagetable objects using iommu_domains with IOMMU driver specific
parameters. This infrastructure will allow linking those domains to IOAS's
and devices.
Link: https://lore.kernel.org/r/12-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Connect the IOAS to its IOCTL interface. This exposes most of the
functionality in the io_pagetable to userspace.
This is intended to be the core of the generic interface that IOMMUFD will
provide. Every IOMMU driver should be able to implement an iommu_domain
that is compatible with this generic mechanism.
It is also designed to be easy to use for simple non virtual machine
monitor users, like DPDK:
- Universal simple support for all IOMMUs (no PPC special path)
- An IOVA allocator that considers the aperture and the allowed/reserved
ranges
- io_pagetable allows any number of iommu_domains to be connected to the
IOAS
- Automatic allocation and re-use of iommu_domains
Along with room in the design to add non-generic features to cater to
specific HW functionality.
Link: https://lore.kernel.org/r/11-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This is the remainder of the IOAS data structure. Provide an object called
an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
along with a list of iommu_domains that mirror the IOVA to PFN map.
At the top this is a simple interval tree of iopt_areas indicating the map
of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
on the attached domains there is a minimum alignment for areas (which may
be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't
be mapped and an IOVA of allowed IOVA that can always be mappable.
The concept of an 'access' refers to something like a VFIO mdev that is
accessing the IOVA and using a 'struct page *' for CPU based access.
Externally an API is provided that matches the requirements of the IOCTL
interface for map/unmap and domain attachment.
The API provides a 'copy' primitive to establish a new IOVA map in a
different IOAS from an existing mapping by re-using the iopt_pages. This
is the basic mechanism to provide single pinning.
This is designed to support a pre-registration flow where userspace would
setup an dummy IOAS with no domains, map in memory and then establish an
access to pin all PFNs into the xarray.
Copy can then be used to create new IOVA mappings in a different IOAS,
with iommu_domains attached. Upon copy the PFNs will be read out of the
xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
overheads.
Link: https://lore.kernel.org/r/10-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The iopt_pages which represents a logical linear list of full PFNs held in
different storage tiers. Each area points to a slice of exactly one
iopt_pages, and each iopt_pages can have multiple areas and accesses.
The three storage tiers are managed to meet these objectives:
- If no iommu_domain or in-kerenel access exists then minimal memory
should be consumed by iomufd
- If a page has been pinned then an iopt_pages will not pin it again
- If an in-kernel access exists then the xarray must provide the backing
storage to avoid allocations on domain removals
- Otherwise any iommu_domain will be used for storage
In a common configuration with only an iommu_domain the iopt_pages does
not allocate significant memory itself.
The external interface for pages has several logical operations:
iopt_area_fill_domain() will load the PFNs from storage into a single
domain. This is used when attaching a new domain to an existing IOAS.
iopt_area_fill_domains() will load the PFNs from storage into multiple
domains. This is used when creating a new IOVA map in an existing IOAS
iopt_pages_add_access() creates an iopt_pages_access that tracks an
in-kernel access of PFNs. This is some external driver that might be
accessing the IOVA using the CPU, or programming PFNs with the DMA
API. ie a VFIO mdev.
iopt_pages_rw_access() directly perform a memcpy on the PFNs, without
the overhead of iopt_pages_add_access()
iopt_pages_fill_xarray() will load PFNs into the xarray and return a
'struct page *' array. It is used by iopt_pages_access's to extract PFNs
for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it
is known the xarray is already filled.
As an iopt_pages can be referred to in slices by many areas and accesses
it uses interval trees to keep track of which storage tiers currently hold
the PFNs. On a page-by-page basis any request for a PFN will be satisfied
from one of the storage tiers and the PFN copied to target domain/array.
Unfill actions are similar, on a page by page basis domains are unmapped,
xarray entries freed or struct pages fully put back.
Significant complexity is required to fully optimize all of these data
motions. The implementation calculates the largest consecutive range of
same-storage indexes and operates in blocks. The accumulation of PFNs
always generates the largest contiguous PFN range possible to optimize and
this gathering can cross storage tier boundaries. For cases like 'fill
domains' care is taken to avoid duplicated work and PFNs are read once and
pushed into all domains.
The map/unmap interaction with the iommu_domain always works in contiguous
PFN blocks. The implementation does not require or benefit from any
split/merge optimization in the iommu_domain driver.
This design suggests several possible improvements in the IOMMU API that
would greatly help performance, particularly a way for the driver to map
and read the pfns lists instead of working with one driver call per page
to read, and one driver call per contiguous range to store.
Link: https://lore.kernel.org/r/9-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The top of the data structure provides an IO Address Space (IOAS) that is
similar to a VFIO container. The IOAS allows map/unmap of memory into
ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables)
and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to
access the PFNs that those IOVA areas cover.
The IO Address Space (IOAS) datastructure is composed of:
- struct io_pagetable holding the IOVA map
- struct iopt_areas representing populated portions of IOVA
- struct iopt_pages representing the storage of PFNs
- struct iommu_domain representing each IO page table in the system IOMMU
- struct iopt_pages_access representing in-kernel accesses of PFNs (ie
VFIO mdevs)
- struct xarray pinned_pfns holding a list of pages pinned by in-kernel
accesses
This patch introduces the lowest part of the datastructure - the movement
of PFNs in a tiered storage scheme:
1) iopt_pages::pinned_pfns xarray
2) Multiple iommu_domains
3) The origin of the PFNs, i.e. the userspace pointer
PFN have to be copied between all combinations of tiers, depending on the
configuration.
The interface is an iterator called a 'pfn_reader' which determines which
tier each PFN is stored and loads it into a list of PFNs held in a struct
pfn_batch.
Each step of the iterator will fill up the pfn_batch, then the caller can
use the pfn_batch to send the PFNs to the required destination. Repeating
this loop will read all the PFNs in an IOVA range.
The pfn_reader and pfn_batch also keep track of the pinned page accounting.
While PFNs are always stored and accessed as full PAGE_SIZE units the
iommu_domain tier can store with a sub-page offset/length to support
IOMMUs with a smaller IOPTE size than PAGE_SIZE.
Link: https://lore.kernel.org/r/8-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This is the basic infrastructure of a new miscdevice to hold the iommufd
IOCTL API.
It provides:
- A miscdevice to create file descriptors to run the IOCTL interface over
- A table based ioctl dispatch and centralized extendable pre-validation
step
- An xarray mapping userspace ID's to kernel objects. The design has
multiple inter-related objects held within in a single IOMMUFD fd
- A simple usage count to build a graph of object relations and protect
against hostile userspace racing ioctls
The only IOCTL provided in this patch is the generic 'destroy any object
by handle' operation.
Link: https://lore.kernel.org/r/6-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
These complement the group interfaces used by VFIO and are for use by
iommufd. The main difference is that multiple devices in the same group
can all share the ownership by passing the same ownership pointer.
Move the common code into shared functions.
Link: https://lore.kernel.org/r/2-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This queries if a domain linked to a device should expect to support
enforce_cache_coherency() so iommufd can negotiate the rules for when a
domain should be shared or not.
For iommufd a device that declares IOMMU_CAP_ENFORCE_CACHE_COHERENCY will
not be attached to a domain that does not support it.
Link: https://lore.kernel.org/r/1-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Yu He <yu.he@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If x2apic is not available, hyperv-iommu skips remapping
irqs. This breaks root partition which always needs irqs
remapped.
Fix this by allowing irq remapping regardless of x2apic,
and change hyperv_enable_irq_remapping() to return
IRQ_REMAP_XAPIC_MODE in case x2apic is missing.
Tested with root and non-root hyperv partitions.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1668715899-8971-1-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The fsl_pamu_probe() returns directly when create_csd() failed, leaving
irq and memories unreleased.
Fix by jumping to error if create_csd() returns error.
Fixes: 695093e38c ("iommu/fsl: Freescale PAMU driver and iommu implementation.")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Link: https://lore.kernel.org/r/20221121082022.19091-1-yuancan@huawei.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The dmar_domain uses bit field members to indicate the behaviors. Add
a bit field for using first level and remove the flags member to avoid
duplication.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20221118132451.114406-8-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The device_domain_info::domain accurately records the domain attached to
the device. It is unnecessary to check whether the context is present in
the attach_dev path. Remove it to make the code neat.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20221118132451.114406-7-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
dmar_domain_attach_device() is more meaningful according to what this
helper does.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20221118132451.114406-6-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Rename iommu_disable_dev_iotlb() to iommu_disable_pci_caps() to pair with
iommu_enable_pci_caps().
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20221118132451.114406-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The Intel IOMMU hardwares support blocking DMA transactions by clearing
the translation table entries. This implements a real blocking domain to
avoid using an empty UNMANAGED domain. The detach_dev callback of the
domain ops is not used in any path. Remove it to avoid dead code as well.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20221118132451.114406-4-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
If domain attaching to device fails, the IOMMU driver should bring the
device to blocking DMA state. The upper layer is expected to recover it
by attaching a new domain. Use device_block_translation() in the error
path of dev_attach to make the behavior specific.
The difference between device_block_translation() and the previous
dmar_remove_one_dev_info() is that, in the scalable mode, it is the
RID2PASID entry instead of context entry being cleared. As a result,
enabling PCI capabilities is moved up.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20221118132451.114406-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Whether or not a domain is attached to the device, the pasid table should
always be valid as long as it has been probed. This moves the pasid table
allocation from the domain attaching device path to device probe path and
frees it in the device release path.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20221118132451.114406-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
- Report a warning if we fail to disable the MMU-500 prefetcher
- Usual mass of devicetree binding additions
- Qualcomm SMMU refactoring and generic "qcom,smmu-500" addition
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmN70ScQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNGwLB/9X44O2PAhfE3nyZjGLZmtXKORhb/QWeTgj
u05S1a486MdVaqhcZtv7pirZzAc2JoIYfwSRKqXtFIgbS0Rk03J5ktNAIZ6KSxjF
RIzUMzIKo4ph0mcL1PZpDSO67Vc1WC2N5Vw6tZqO2XkesQGjudaGyRBZjUtKWGlQ
11xW9NnrGuw7G7dC4SYbv6O/kkg1FJoplmx9Nd+8indYomHp1NAk3aRUhKULws3K
dgYitABUyZ+37MDFK+8fJaIaWxiTxiYRLqEPsF2wlbczQYk9EcfzgkKrNf8Xw5hJ
1E6vypG5ZqxjIhSEyv/8vf6xTakG1EsC85fkQZw32AJjFG0WVGuU
=gKok
-----END PGP SIGNATURE-----
Merge tag 'arm-smmu-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/smmu
Arm SMMU updates for 6.2
- Report a warning if we fail to disable the MMU-500 prefetcher
- Usual mass of devicetree binding additions
- Qualcomm SMMU refactoring and generic "qcom,smmu-500" addition
SMMUv2 DT binding additions, including a generic Qualcomm compatible
string ("qcom,smmu-500") which will hopefully spell the end for
pointless SoC-specific additions in future.
* for-joerg/arm-smmu/bindings:
iommu/arm-smmu-qcom: Add SM6350 SMMUv2
dt-bindings: arm-smmu: Add SM6350 GPU SMMUv2
iommu/arm-smmu-qcom: Add generic qcom,smmu-500 match entry
iommu/arm-smmu-qcom: Stop using mmu500 reset for v2 MMUs
iommu/arm-smmu-qcom: Merge table from arm-smmu-qcom-debug into match data
iommu/arm-smmu-qcom: provide separate implementation for SDM845-smmu-500
iommu/arm-smmu-qcom: Move the qcom,adreno-smmu check into qcom_smmu_create
iommu/arm-smmu-qcom: Move implementation data into match data
dt-bindings: arm-smmu: Add generic qcom,smmu-500 bindings
dt-bindings: arm-smmu: add special case for Google Cheza platform
dt-bindings: arm-smmu: fix clocks/clock-names schema
dt-bindings: arm-smmu: Add missing Qualcomm SMMU compatibles
dt-bindings: iommu: arm-smmu: add sdm670 compatible
iommu/arm-smmu-qcom: Add SM6115 support
dt-bindings: arm-smmu: Add compatible for Qualcomm SM6115
drivers: arm-smmu-impl: Add QDU1000 and QRU1000 iommu implementation
dt-bindings: arm-smmu: Add 'compatible' for QDU1000 and QRU1000
DMA allocations can never be turned back into a page pointer, so
requesting compound pages doesn't make sense and it can't even be
supported at all by various backends.
Reject __GFP_COMP with a warning in dma_alloc_attrs, and stop clearing
the flag in the arm dma ops and dma-iommu.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Аdded a return value check for the function
mmu_notifier_register().
Return value of a function 'mmu_notifier_register'
called at iommu_v2.c:642 is not checked,
but it is usually checked for this function
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Denis Arefev <arefev@swemel.ru>
Link: https://lore.kernel.org/r/20221118104252.122809-1-arefev@swemel.ru
[joro: Fix commit message ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
As comment of pci_get_domain_bus_and_slot() says, it returns
a pci device with refcount increment, when finish using it,
the caller must decrement the reference count by calling
pci_dev_put(). So call it before returning from ppr_notifier()
to avoid refcount leak.
Fixes: daae2d25a4 ("iommu/amd: Don't copy GCR3 table root pointer")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lore.kernel.org/r/20221118093604.216371-1-yangyingliang@huawei.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
SRS cap is the hardware cap telling if the hardware IOMMU can support
requests seeking supervisor privilege or not. SRE bit in scalable-mode
PASID table entry is treated as Reserved(0) for implementation not
supporting SRS cap.
Checking SRS cap before setting SRE bit can avoid the non-recoverable
fault of "Non-zero reserved field set in PASID Table Entry" caused by
setting SRE bit while there is no SRS cap support. The fault messages
look like below:
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read NO_PASID] Request device [00:0d.0] fault addr 0x1154e1000
[fault reason 0x5a]
SM: Non-zero reserved field set in PASID Table Entry
Fixes: 6f7db75e1c ("iommu/vt-d: Add second level page table interface")
Cc: stable@vger.kernel.org
Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Link: https://lore.kernel.org/r/20221115070346.1112273-1-tina.zhang@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20221116051544.26540-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The A/D bits are preseted for IOVA over first level(FL) usage for both
kernel DMA (i.e, domain typs is IOMMU_DOMAIN_DMA) and user space DMA
usage (i.e., domain type is IOMMU_DOMAIN_UNMANAGED).
Presetting A bit in FL requires to preset the bit in every related paging
entries, including the non-leaf ones. Otherwise, hardware may treat this
as an error. For example, in a case of ECAP_REG.SMPWC==0, DMA faults might
occur with below DMAR fault messages (wrapped for line length) dumped.
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read NO_PASID] Request device [aa:00.0] fault addr 0x10c3a6000
[fault reason 0x90]
SM: A/D bit update needed in first-level entry when set up in no snoop
Fixes: 289b3b005c ("iommu/vt-d: Preset A/D bits for user space DMA usage")
Cc: stable@vger.kernel.org
Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Link: https://lore.kernel.org/r/20221113010324.1094483-1-tina.zhang@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20221116051544.26540-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>