516 Commits

Author SHA1 Message Date
Nishanth Aravamudan
bb0054552d powerpc/powernv/pci-ioda: fix 32-bit TCE table init in kdump kernel
When attempting to kdump with the 4.2 kernel, we see for each PCI
device:

 pci 0003:01     : [PE# 000] Assign DMA32 space
 pci 0003:01     : [PE# 000] Setting up 32-bit TCE table at 0..80000000
 pci 0003:01     : [PE# 000] Failed to create 32-bit TCE table, err -22
 PCI: Domain 0004 has 8 available 32-bit DMA segments
 PCI: 4 PE# for a total weight of 70
 pci 0004:01     : [PE# 002] Assign DMA32 space
 pci 0004:01     : [PE# 002] Setting up 32-bit TCE table at 0..80000000
 pci 0004:01     : [PE# 002] Failed to create 32-bit TCE table, err -22
 pci 0004:0d     : [PE# 005] Assign DMA32 space
 pci 0004:0d     : [PE# 005] Setting up 32-bit TCE table at 0..80000000
 pci 0004:0d     : [PE# 005] Failed to create 32-bit TCE table, err -22
 pci 0004:0e     : [PE# 006] Assign DMA32 space
 pci 0004:0e     : [PE# 006] Setting up 32-bit TCE table at 0..80000000
 pci 0004:0e     : [PE# 006] Failed to create 32-bit TCE table, err -22
 pci 0004:10     : [PE# 008] Assign DMA32 space
 pci 0004:10     : [PE# 008] Setting up 32-bit TCE table at 0..80000000
 pci 0004:10     : [PE# 008] Failed to create 32-bit TCE table, err -22

and eventually the kdump kernel fails to boot as none of the PCI devices
(including the disk controller) are successfully initialized.

The EINVAL response is because the DMA window (the 2GB base window) is
larger than the kdump kernel's reserved memory (crashkernel=, in this
case specified to be 1024M). The check in question,

 if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))

is a valid sanity check for pnv_pci_ioda2_table_alloc_pages(), so adjust
the caller to pass in a smaller window size if our maximum memory value
is smaller than the DMA window.

After this change, the PCI devices successfully set up the 32-bit TCE
table and kdump succeeds.

The problem was seen on a Firestone machine originally.

Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages")
Cc: stable@vger.kernel.org # 4.2
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[mpe: Coding style pedantry, use u64, change the indentation]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-09-07 20:13:14 +10:00
Linus Torvalds
ff474e8ca8 powerpc updates for 4.3
- Support "hybrid" iommu/direct DMA ops for coherent_mask < dma_mask from Benjamin Herrenschmidt
  - EEH fixes for SRIOV from Gavin
  - Introduce rtas_get_sensor_fast() for IRQ handlers from Thomas Huth
  - Use hardware RNG for arch_get_random_seed_* not arch_get_random_* from Paul Mackerras
  - Seccomp filter support from Michael Ellerman
  - opal_cec_reboot2() handling for HMIs & machine checks from Mahesh Salgaonkar
  - Add powerpc timebase as a trace clock source from Naveen N. Rao
  - Misc cleanups in the xmon, signal & SLB code from Anshuman Khandual
  - Add an inline function to update POWER8 HID0 from Gautham R. Shenoy
  - Fix pte_pagesize_index() crash on 4K w/64K hash from Michael Ellerman
  - Drop support for 64K local store on 4K kernels from Michael Ellerman
  - move dma_get_required_mask() from pnv_phb to pci_controller_ops from Andrew Donnellan
  - Initialize distance lookup table from drconf path from Nikunj A Dadhania
  - Enable RTC class support from Vaibhav Jain
  - Disable automatically blocked PCI config from Gavin Shan
  - Add LEDs driver for PowerNV platform from Vasant Hegde
  - Fix endianness issues in the HVSI driver from Laurent Dufour
  - Kexec endian fixes from Samuel Mendoza-Jonas
  - Fix corrupted pdn list from Gavin Shan
  - Fix fenced PHB caused by eeh_slot_error_detail() from Gavin Shan
 
  - Freescale updates from Scott: Highlights include 32-bit memcpy/memset
    optimizations, checksum optimizations, 85xx config fragments and updates,
    device tree updates, e6500 fixes for non-SMP, and misc cleanup and minor
    fixes.
 
  - A ton of cxl updates & fixes:
   - Add explicit precision specifiers from Rasmus Villemoes
   - use more common format specifier from Rasmus Villemoes
   - Destroy cxl_adapter_idr on module_exit from Johannes Thumshirn
   - Destroy afu->contexts_idr on release of an afu from Johannes Thumshirn
   - Compile with -Werror from Daniel Axtens
   - EEH support from Daniel Axtens
   - Plug irq_bitmap getting leaked in cxl_context from Vaibhav Jain
   - Add alternate MMIO error handling from Ian Munsie
   - Allow release of contexts which have been OPENED but not STARTED from Andrew Donnellan
   - Remove use of macro DEFINE_PCI_DEVICE_TABLE from Vaishali Thakkar
   - Release irqs if memory allocation fails from Vaibhav Jain
   - Remove racy attempt to force EEH invocation in reset from Daniel Axtens
   - Fix + cleanup error paths in cxl_dev_context_init from Ian Munsie
   - Fix force unmapping mmaps of contexts allocated through the kernel api from Ian Munsie
   - Set up and enable PSL Timebase from Philippe Bergheaud
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV5+GzAAoJEFHr6jzI4aWA0iAP/jcd0kNaNBzLgcDKKygKdgz4
 xn4EWu81vfMfZYWesb0ATrjlH0hLsRxSXoFUqUMhtJTa5kNAoCIaz/M8WBALS50h
 aT+i7br4WEU2j2FcaMyP3iAZx/2hl+2utODJSHPRWPkec1fUDBfEyBf++e520RWM
 HUQGIGZXh8yq7KMA96Pwhsvls9vOB8hS2UdU/NS8ff3J5jFvXC1/WmF2qfzJBS1V
 8iHyz26Jl8+dJ+et7iC2oD5XQAjIH1oJgOyPVPBzAQttfi8RjuVzRA30TfPBAUwI
 lC9nlmPy6bCe4kiQYWVB1z7GegHyW/9vkeuMj/u8mZbqpaayMEMZmd2C3iNDXNHx
 i2NSvdln539t4qWYsV2v6lVCfa/ayDHD73Wackj5Dk394tzXnpCPhxNzc2yKEd5v
 h7vwYc9jBhsbfSCSogaM+gSHJ1APgCidggHJMYYNA2nN2u6V62RpsMB7zp/1+Q2v
 yqYdD8oYF4Dm21x/ujaNFrlizROD46WS0UqdJ3yP6HAqRYIpRXtibmpECJgt1n5h
 HjADEci4hQ2UQxdMdp/Q5KZnPTJebBtrZrmkW5r6cZBUaTB5TVkFaEWN44CT/Loh
 tMNeA3qOBN06CaQS2WL3UUUWpbZq9fSbWuUZ5lWZDb5AOyRxe5eWVYNLkiyIXozY
 L24l1bYdBhXahnjoS/kc
 =n9+X
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - support "hybrid" iommu/direct DMA ops for coherent_mask < dma_mask
   from Benjamin Herrenschmidt

 - EEH fixes for SRIOV from Gavin

 - introduce rtas_get_sensor_fast() for IRQ handlers from Thomas Huth

 - use hardware RNG for arch_get_random_seed_* not arch_get_random_*
   from Paul Mackerras

 - seccomp filter support from Michael Ellerman

 - opal_cec_reboot2() handling for HMIs & machine checks from Mahesh
   Salgaonkar

 - add powerpc timebase as a trace clock source from Naveen N.  Rao

 - misc cleanups in the xmon, signal & SLB code from Anshuman Khandual

 - add an inline function to update POWER8 HID0 from Gautham R.  Shenoy

 - fix pte_pagesize_index() crash on 4K w/64K hash from Michael Ellerman

 - drop support for 64K local store on 4K kernels from Michael Ellerman

 - move dma_get_required_mask() from pnv_phb to pci_controller_ops from
   Andrew Donnellan

 - initialize distance lookup table from drconf path from Nikunj A
   Dadhania

 - enable RTC class support from Vaibhav Jain

 - disable automatically blocked PCI config from Gavin Shan

 - add LEDs driver for PowerNV platform from Vasant Hegde

 - fix endianness issues in the HVSI driver from Laurent Dufour

 - kexec endian fixes from Samuel Mendoza-Jonas

 - fix corrupted pdn list from Gavin Shan

 - fix fenced PHB caused by eeh_slot_error_detail() from Gavin Shan

 - Freescale updates from Scott: Highlights include 32-bit memcpy/memset
   optimizations, checksum optimizations, 85xx config fragments and
   updates, device tree updates, e6500 fixes for non-SMP, and misc
   cleanup and minor fixes.

 - a ton of cxl updates & fixes:
    - add explicit precision specifiers from Rasmus Villemoes
    - use more common format specifier from Rasmus Villemoes
    - destroy cxl_adapter_idr on module_exit from Johannes Thumshirn
    - destroy afu->contexts_idr on release of an afu from Johannes
      Thumshirn
    - compile with -Werror from Daniel Axtens
    - EEH support from Daniel Axtens
    - plug irq_bitmap getting leaked in cxl_context from Vaibhav Jain
    - add alternate MMIO error handling from Ian Munsie
    - allow release of contexts which have been OPENED but not STARTED
      from Andrew Donnellan
    - remove use of macro DEFINE_PCI_DEVICE_TABLE from Vaishali Thakkar
    - release irqs if memory allocation fails from Vaibhav Jain
    - remove racy attempt to force EEH invocation in reset from Daniel
      Axtens
    - fix + cleanup error paths in cxl_dev_context_init from Ian Munsie
    - fix force unmapping mmaps of contexts allocated through the kernel
      api from Ian Munsie
    - set up and enable PSL Timebase from Philippe Bergheaud

* tag 'powerpc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (140 commits)
  cxl: Set up and enable PSL Timebase
  cxl: Fix force unmapping mmaps of contexts allocated through the kernel api
  cxl: Fix + cleanup error paths in cxl_dev_context_init
  powerpc/eeh: Fix fenced PHB caused by eeh_slot_error_detail()
  powerpc/pseries: Cleanup on pci_dn_reconfig_notifier()
  powerpc/pseries: Fix corrupted pdn list
  powerpc/powernv: Enable LEDS support
  powerpc/iommu: Set default DMA offset in dma_dev_setup
  cxl: Remove racy attempt to force EEH invocation in reset
  cxl: Release irqs if memory allocation fails
  cxl: Remove use of macro DEFINE_PCI_DEVICE_TABLE
  powerpc/powernv: Fix mis-merge of OPAL support for LEDS driver
  powerpc/powernv: Reset HILE before kexec_sequence()
  powerpc/kexec: Reset secondary cpu endianness before kexec
  powerpc/hvsi: Fix endianness issues in the HVSI driver
  leds/powernv: Add driver for PowerNV platform
  powerpc/powernv: Create LED platform device
  powerpc/powernv: Add OPAL interfaces for accessing and modifying system LED states
  powerpc/powernv: Fix the log message when disabling VF
  cxl: Allow release of contexts which have been OPENED but not STARTED
  ...
2015-09-03 16:41:38 -07:00
Linus Torvalds
17e6b00ac4 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "This updated pull request does not contain the last few GIC related
  patches which were reported to cause a regression.  There is a fix
  available, but I let it breed for a couple of days first.

  The irq departement provides:

   - new infrastructure to support non PCI based MSI interrupts
   - a couple of new irq chip drivers
   - the usual pile of fixlets and updates to irq chip drivers
   - preparatory changes for removal of the irq argument from interrupt
     flow handlers
   - preparatory changes to remove IRQF_VALID"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
  irqchip/imx-gpcv2: IMX GPCv2 driver for wakeup sources
  irqchip: Add bcm2836 interrupt controller for Raspberry Pi 2
  irqchip: Add documentation for the bcm2836 interrupt controller
  irqchip/bcm2835: Add support for being used as a second level controller
  irqchip/bcm2835: Refactor handle_IRQ() calls out of MAKE_HWIRQ
  PCI: xilinx: Fix typo in function name
  irqchip/gic: Ensure gic_cpu_if_up/down() programs correct GIC instance
  irqchip/gic: Only allow the primary GIC to set the CPU map
  PCI/MSI: pci-xgene-msi: Consolidate chained IRQ handler install/remove
  unicore32/irq: Prepare puv3_gpio_handler for irq argument removal
  tile/pci_gx: Prepare trio_handle_level_irq for irq argument removal
  m68k/irq: Prepare irq handlers for irq argument removal
  C6X/megamode-pic: Prepare megamod_irq_cascade for irq argument removal
  blackfin: Prepare irq handlers for irq argument removal
  arc/irq: Prepare idu_cascade_isr for irq argument removal
  sparc/irq: Use access helper irq_data_get_affinity_mask()
  sparc/irq: Use helper irq_data_get_irq_handler_data()
  parisc/irq: Use access helper irq_data_get_affinity_mask()
  mn10300/irq: Use access helper irq_data_get_affinity_mask()
  irqchip/i8259: Prepare i8259_irq_dispatch for irq argument removal
  ...
2015-09-01 14:33:35 -07:00
Alexey Kardashevskiy
0e1ffef02c powerpc/iommu: Set default DMA offset in dma_dev_setup
Commit e91c25111aa3 "powerpc/iommu: Cleanup setting of DMA base/offset"
expects that the default DMA offset is set from pnv_ioda_setup_bus_dma()
which is correct unless it is SRIOV where the code flow is different -
at the moment when pnv_ioda_setup_bus_dma() is called, PCI devices for
VFs are not created yet.

This adds missing set_dma_offset() to pnv_pci_ioda_dma_dev_setup() to
cover the case of SRIOV.

Note that we still need set_dma_offset() in pnv_ioda_setup_bus_dma() as
at the boot time pnv_pci_ioda_dma_dev_setup() is called when no PE was
created yet, this happens at the PHB fixup stage.

Fixes: e91c25111aa3 ("powerpc/iommu: Cleanup setting of DMA base/offset")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-27 19:35:18 +10:00
Samuel Mendoza-Jonas
e72bb8a5a8 powerpc/powernv: Reset HILE before kexec_sequence()
On powernv secondary cpus are returned to OPAL, and will then enter
the target kernel in big-endian. However if it is set the HILE bit
will persist, causing the first exception in the target kernel to be
delivered in litte-endian regardless of the current endianness.

If running on top of OPAL make sure the HILE bit is reset once we've
finished waiting for all of the secondaries to be returned to OPAL.

Signed-off-by: Samuel Mendoza-Jonas <sam.mj@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-20 18:19:09 +10:00
Vasant Hegde
c159b5968e powerpc/powernv: Create LED platform device
This patch adds platform devices for leds. Also export LED related
OPAL API's so that led driver can use these APIs.

Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-20 18:19:07 +10:00
Anshuman Khandual
8a8d91817a powerpc/powernv: Add OPAL interfaces for accessing and modifying system LED states
This patch registers the following two new OPAL interfaces calls
for the platform LED subsystem. With the help of these new OPAL calls,
the kernel will be able to get or set the state of various individual
LEDs on the system at any given location code which is passed through
the LED specific device tree nodes.

	(1) OPAL_LEDS_GET_INDICATOR     opal_leds_get_ind
	(2) OPAL_LEDS_SET_INDICATOR     opal_leds_set_ind

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Tested-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-20 18:19:07 +10:00
Wei Yang
74703cc4e0 powerpc/powernv: Fix the log message when disabling VF
On powernv platform, IOV BAR would be shifted if necessary. While the log
message is not correct when disabling VFs.

This patch fixes this by print correct message based on the offset value.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-20 16:15:24 +10:00
Andrew Donnellan
53522982fc powerpc/powernv: move dma_get_required_mask from pnv_phb to pci_controller_ops
Simplify the dma_get_required_mask call chain by moving it from pnv_phb to
pci_controller_ops, similar to commit 763d2d8df1ee ("powerpc/powernv:
Move dma_set_mask from pnv_phb to pci_controller_ops").

Previous call chain:

  0) call dma_get_required_mask() (kernel/dma.c)
  1) call ppc_md.dma_get_required_mask, if it exists. On powernv, that
     points to pnv_dma_get_required_mask() (platforms/powernv/setup.c)
  2) device is PCI, therefore call pnv_pci_dma_get_required_mask()
     (platforms/powernv/pci.c)
  3) call phb->dma_get_required_mask if it exists
  4) it only exists in the ioda case, where it points to
       pnv_pci_ioda_dma_get_required_mask() (platforms/powernv/pci-ioda.c)

New call chain:

  0) call dma_get_required_mask() (kernel/dma.c)
  1) device is PCI, therefore call pci_controller_ops.dma_get_required_mask
     if it exists
  2) in the ioda case, that points to pnv_pci_ioda_dma_get_required_mask()
     (platforms/powernv/pci-ioda.c)

In the p5ioc2 case, the call chain remains the same -
dma_get_required_mask() does not find either a ppc_md call or
pci_controller_ops call, so it calls __dma_get_required_mask().

Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-18 19:32:11 +10:00
Gautham R. Shenoy
e63dbd16ab powerpc: Add an inline function to update POWER8 HID0
Section 3.7 of Version 1.2 of the Power8 Processor User's Manual
prescribes that updates to HID0 be preceded by a SYNC instruction and
followed by an ISYNC instruction (Page 91).

Create an inline function name update_power8_hid0() which follows this
recipe and invoke it from the static split core path.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Tested-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-14 15:58:28 +10:00
Mahesh Salgaonkar
62521ea6db powerpc/powernv: Invoke opal_cec_reboot2() on unrecoverable HMI.
Invoke new opal_cec_reboot2() call with reboot type
OPAL_REBOOT_PLATFORM_ERROR (for unrecoverable HMI interrupts) to inform
BMC/OCC about this error, so that BMC can collect relevant data for error
analysis and decide what component to de-configure before rebooting.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-06 15:10:19 +10:00
Mahesh Salgaonkar
e784b6499d powerpc/powernv: Invoke opal_cec_reboot2() on unrecoverable machine check errors.
On non-recoverable MCE errors in kernel space, Linux kernel panics
and system reboots. On BMC based system opal-prd runs as a daemon
in the host. Hence, kernel crash may prevent opal-prd to detect and
analyze this MCE error. This may land us in a situation where the faulty
memory never gets de-configured and Linux would keep hitting same MCE error
again and again. If this happens in early stage of kernel initialization,
then Linux will keep crashing and rebooting in a loop.

This patch fixes this issue by invoking new opal_cec_reboot2() call with
reboot type OPAL_REBOOT_PLATFORM_ERROR to inform BMC/OCC about this
error, so that BMC can collect relevant data for error analysis and
decide what component to de-configure before rebooting.

This patch is dependent on OPAL patchset posted on skiboot mailing list
at https://lists.ozlabs.org/pipermail/skiboot/2015-July/001771.html that
introduces opal_cec_reboot2() opal call.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-06 15:10:18 +10:00
Mahesh Salgaonkar
1852ae276b powerpc/powernv: Pull all HMI events before panic.
In the event of unrecovered HMI the existing code panics as soon as
it receives the first unrecovered HMI event. This makes host to report
partial information about HMIs before panic. There may be more errors
which would have caused the HMI and hence more HMI event would have been
generated waiting to be pulled by host. This patch implements a logic to
pull and display all the HMI event before going down panic path.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-06 15:10:18 +10:00
Mahesh Salgaonkar
c33e11d0dd powerpc/powernv: display reason for Malfunction Alert HMI.
The V2 version of HMI event now carries additional information for
Malfunction Alert. It now contains error information about CORE and NX
checkstop. This patch checks and displays the check stop reason before
panic.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-06 15:09:59 +10:00
Alistair Popple
b8d65e9662 powerpc/eeh-powernv: Fix unbalanced IRQ warning
pnv_eeh_next_error() re-enables the eeh opal event interrupt but it
gets called from a loop if there are more outstanding events to
process, resulting in a warning due to enabling an already enabled
interrupt. Instead the interrupt should only be re-enabled once the
last outstanding event has been processed.

Tested-by: Daniel Axtens <dja@axtens.net>
Reported-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-30 19:01:32 +10:00
Marc Zyngier
ad3aedfbb0 genirq/irqdomain: Allow irq domain aliasing
It is not uncommon (at least with the ARM stuff) to have a piece
of hardware that implements different flavours of "interrupts".
A typical example of this is the GICv3 ITS, which implements
standard PCI/MSI support, but also some form of "generic MSI".

So far, the PCI/MSI domain is registered using the ITS device_node,
so that irq_find_host can return it. On the contrary, the raw MSI
domain is not registered with an device_node, making it impossible
to be looked up by another subsystem (obviously, using the same
device_node twice would only result in confusion, as it is not
defined which one irq_find_host would return).

A solution to this is to "type" domains that may be aliasing, and
to be able to lookup an device_node that matches a given type.
For this, we introduce irq_find_matching_host() as a superset
of irq_find_host:

struct irq_domain *irq_find_matching_host(struct device_node *node,
                                enum irq_domain_bus_token bus_token);

where bus_token is the "type" we want to match the domain against
(so far, only DOMAIN_BUS_ANY is defined). This result in some
moderately invasive changes on the PPC side (which is the only
user of the .match method).

This has otherwise no functionnal change.

Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: <linux-arm-kernel@lists.infradead.org>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Ma Jun <majun258@huawei.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Duc Dang <dhdang@apm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/1438091186-10244-2-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-07-30 00:14:36 +02:00
Thomas Gleixner
4b979e4c61 Merge branch 'linus' into irq/core
Pull in upstream fixes before applying conflicting changes
2015-07-30 00:13:24 +02:00
Alexey Kardashevskiy
3ba3a73e9f powerpc/powernv/ioda2: Fix calculation for memory allocated for TCE table
The existing code stores the amount of memory allocated for a TCE table.
At the moment it uses @offset which is a virtual offset in the TCE table
which is only correct for a one level tables and it does not include
memory allocated for intermediate levels. When multilevel TCE table is
requested, WARN_ON in tce_iommu_create_table() prints a warning.

This adds an additional counter to pnv_pci_ioda2_table_do_alloc_pages()
to count actually allocated memory.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-23 19:56:44 +10:00
Paul Mackerras
01c9348c76 powerpc: Use hardware RNG for arch_get_random_seed_* not arch_get_random_*
The hardware RNG on POWER8 and POWER7+ can be relatively slow, since
it can only supply one 64-bit value per microsecond.  Currently we
read it in arch_get_random_long(), but that slows down reading from
/dev/urandom since the code in random.c calls arch_get_random_long()
for every longword read from /dev/urandom.

Since the hardware RNG supplies high-quality entropy on every read, it
matches the semantics of arch_get_random_seed_long() better than those
of arch_get_random_long().  Therefore this commit makes the code use
the POWER8/7+ hardware RNG only for arch_get_random_seed_{long,int}
and not for arch_get_random_{long,int}.

This won't affect any other PowerPC-based platforms because none of
them currently support a hardware RNG.  To make it clear that the
ppc_md function pointer is used for arch_get_random_seed_*, we rename
it from get_random_long to get_random_seed.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-23 19:52:03 +10:00
Jiang Liu
2921d1790e powerpc/PCI: Use for_pci_msi_entry() to access MSI device list
Use accessor for_each_pci_msi_entry() to access MSI device list, so we
could easily move msi_list from struct pci_dev into struct device
later.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Stuart Yoder <stuart.yoder@freescale.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Olof Johansson <olof@lixom.net>
Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: Scott Wood <scottwood@freescale.com>
Cc: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Cc: Tudor Laurentiu <b10716@freescale.com>
Cc: Hongtao Jia <hongtao.jia@freescale.com>
Link: http://lkml.kernel.org/r/1436428847-8886-4-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-07-22 18:37:42 +02:00
Gavin Shan
79cd952000 powerpc/eeh: Dump PHB diag-data for non-existing PE
When detecting EEH error on non-existing PE, including the reserved
one, the PE is simply unfrozen without dumping the PHB diag-data,
which is useful for locating the root cause of the EEH error. The
patch dumps the PHB diag-data when non-existing PE reports error.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-21 11:38:47 +10:00
Gavin Shan
0f36db7764 powerpc/eeh: Fix wrong printed PE number
On LE kernel, the non-existing PE number in BE format derived from
skiboot firmware isn't converted to LE format properly as following
kernel log indicates:

   EEH: Clear non-existing PHB#4-PE#200000000000000

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-21 11:38:46 +10:00
Vipin K Parashar
3b476aadbc powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platform
This patch adds support for OPAL EPOW (Environmental and Power Warnings)
and DPO (Delayed Power Off) events for the PowerNV platform. These events
are generated on FSP (Flexible Service Processor) based systems. EPOW
events are generated due to various critical system conditions that
require system shutdown. A few examples of these conditions are high
ambient temperature or system running on UPS power with low UPS battery.
DPO event is generated in response to admin initiated system shutdown
request. Upon receipt of EPOW and DPO events the host kernel invokes
orderly_poweroff() for performing graceful system shutdown.

Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Acked-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-16 13:34:36 +10:00
Gavin Shan
f951e51003 powerpc/powernv: Unfreeze VF PE on releasing it
When releasing PE for SRIOV VF, the PE is forced to be frozen
wrongly. When the same PE is picked for another VF, it won't
work anyhow. The patch fixes the issue by unfreezing, not
freezing the VF PE when releasing it.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13 16:12:30 +10:00
Gavin Shan
283e2d8a59 powerpc/powernv: Include VF PE in PELTV of PF PE
The PELTV of PF PE should include VF PE, which is missed by current
code, so that the VF PE is frozen automatically when freezing PF PE.
The patch fixes the PELTV of PF PE to include VF PE.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13 16:12:22 +10:00
Gavin Shan
26ba248d52 powerpc/powernv: Pick M64 PEs based on BARs
On PHB3, PE might be reserved in advance to reflect the M64 segments
consumed by the PE according to M64 BARs (exclude VF BARs) of the PCI
devices included in the PE. The PE is picked based on M64 BARs instead
of the bridge's M64 windows, which might include VF BARs. Otherwise,
wrong PE could be picked.

The patch calculates the used M64 segments and PE numbers according to
the M64 BARs, excluding VF BARs, of PCI devices in one particular PE,
instead of the bridge's M64 windows. Then the right PE number is picked.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13 16:12:01 +10:00
Gavin Shan
d1203852df powerpc/powernv: Boolean argument for pnv_ioda_setup_bus_PE()
The patch changes the type of last argument of pnv_ioda_setup_bus_PE()
and phb::pick_m64_pe() to boolean. No functional change.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13 16:12:00 +10:00
Gavin Shan
96a2f92bf8 powerpc/powernv: Reserve M64 PEs based on BARs
On PHB3, some PEs might be reserved in advance to reflect the M64
segments consumed by those PEs. We're reserving PEs based on the
M64 window of root port, which might contain VF BAR. The PEs for
VFs are allocated dynamically, not reserved based on the consumed
M64 segments. So the M64 window of root port isn't reliable for
the task. Instead, we go through M64 BARs (VF BARs excluded) of
PCI devices under the specified root bus and reserve PEs accordingly,
as the patch does.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13 16:12:00 +10:00
Gavin Shan
e9dc4d7f72 powerpc/powernv: Allow to reserve one PE for multiple times
The PE numbers are reserved according to root port's M64 window,
which is aligned to M64 segment finely. So one PE shouldn't be
reserved for multiple times. We will reserve PE numbers according
to the M64 BARs of PCI device in subsequent patches, which aren't
aligned to M64 segment size finely. It means one particular PE
could be reserved for multiple times.

The patch allows one PE to be reserved for multiple times and we
print the warning message at debugging level.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13 16:12:00 +10:00
Benjamin Herrenschmidt
e91c25111a powerpc/iommu: Cleanup setting of DMA base/offset
Now that the table and the offset can co-exist, we no longer need
to flip/flop, we can just establish both once at boot time.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13 10:10:55 +10:00
Alistair Popple
a8956a7b72 powerpc/powernv: Fix opal-elog interrupt handler
The conversion of opal events to a proper irqchip means that handlers
are called until the relevant opal event has been cleared by
processing it. Events that queue work should therefore use a threaded
handler to mask the event until processing is complete.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-06 20:24:36 +10:00
Vaidyanathan Srinivasan
d8ea782b56 powerpc/powernv: Fix vma page prot flags in opal-prd driver
opal-prd driver will mmap() firmware code/data area as private
mapping to prd user space daemon.  Write to this page will
trigger COW faults.  The new COW pages are normal kernel RAM
pages accounted by the kernel and are not special.

vma->vm_page_prot value will be used at page fault time
for the new COW pages, while pgprot_t value passed in
remap_pfn_range() is used for the initial page table entry.

Hence:
* Do not add _PAGE_SPECIAL in vma, but only for remap_pfn_range()
* Also remap_pfn_range() will add the _PAGE_SPECIAL flag using
  pte_mkspecial() call, hence no need to specify in the driver

This fix resolves the page accounting warning shown below:
BUG: Bad rss-counter state mm:c0000007d34ac600 idx:1 val:19

The above warning is triggered since _PAGE_SPECIAL was incorrectly
being set for the normal kernel COW pages.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-06 12:06:42 +10:00
Alexey Kardashevskiy
5c89a87d13 powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()
When pnv_pci_ioda_fixup() is called during PHB fixup time, each PE in
the sorted list of PEs (phb::pe_dma_list) is iterated to setup the PE's
DMA32 space by pnv_ioda_setup_bus_dma() if the PE's DMA32 weight is bigger
than zero. The function also assigns all the subordinate PCI devices of
the PE's primary bus with the PE's DMA32 IOMMU table. It causes the PCI
devicess in the child PEs, which don't have DMA weight, receives wrong
IOMMU table and then IOMMU group.

The patch fixes above issue by more check on the PE's coverage and don't
assign IOMMU table to those PCI devices, which belong to the child PEs.
The problem was found on Firestone platform initially.

Suggested-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-19 17:10:29 +10:00
Alexey Kardashevskiy
b5926430df powerpc/iommu/ioda2: Enable compile with IOV=on and IOMMU_API=off
The pnv_pci_ioda2_unset_window() function is used to do the final
cleanup of a DMA window being released:
- via VFIO ioctl by the guest request;
- via unplugging a virtual PCI function.
However the function was under #ifdef CONFIG_IOMMU_API and was missing.

This moves the helper outside of IOMMU_API block and enables it
for either or both IOMMU_API and PCI_IOV.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-18 07:16:02 +10:00
Jeremy Kerr
7185795a62 powerpc/powernv: fix construction of opal PRD messages
We currently have a bug in the PRD code, where the contents of an
incoming message (beyond the header) will be overwritten by the list
item manipulations when adding to to the prd_msg_queue.

This change reorders struct opal_prd_msg_queue_item, so that the
message body doesn't overlap the list_head.

We also clarify the memcpy of the message, as we're copying unnecessary
bytes at the end of the message data.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-18 07:16:01 +10:00
Alistair Popple
02b6505c8f powerpc/powernv: Increase opal-irqchip initcall priority
The eeh subsystem for powernv requires the opal event irqchip to be
initialised prior to initialisation or the following errors are
produced (and eeh doesn't work as expected):

irq: XICS didn't like hwirq-0x9 to VIRQ17 mapping (rc=-22)
pnv_eeh_post_init: Can't request OPAL event interrupt (0)

On powernv eeh is initialised from a subsys_initcall due to a check
for machine_is(powernv) in eeh_init(). This patch increases the
initcall priority of opal_event_init() to an arch_initcall to ensure
the opal event interface is initialised prior to any users of it.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Reported-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-18 07:16:01 +10:00
Michael Ellerman
4bece972fc powerpc/powernv: pnv_init_idle_states() should only run on powernv
Although this init call checks for device tree properties before doing
anything, it should still only run on powernv machines.

Reviewed-by: Shreyas B Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-15 16:45:12 +10:00
Alexey Kardashevskiy
46d3e1e162 vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control
Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.

This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU implements take/release_ownership() callbacks, this lets
the user have full control over the IOMMU group. When the ownership
is taken, the platform code removes all the windows so the caller must
create them.
Before returning the ownership back to the platform code, VFIO
unprograms and removes all the tables it created.

This changes IODA2's onwership handler to remove the existing table
rather than manipulating with the existing one. From now on,
iommu_take_ownership() and iommu_release_ownership() are only called
from the vfio_iommu_spapr_tce driver.

Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.

No change in userspace-visible behaviour is expected. Since it recreates
TCE tables on each ownership change, related kernel traces will appear
more often.

This adds a pnv_pci_ioda2_setup_default_config() which is called
when PE is being configured at boot time and when the ownership is
passed from VFIO to the platform code.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:54 +10:00
Alexey Kardashevskiy
0054719386 powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table
This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_ioda2_get_table_size()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:53 +10:00
Alexey Kardashevskiy
c035e37b58 powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release
The existing code programmed TVT#0 with some address and then
immediately released that memory.

This makes use of pnv_pci_ioda2_unset_window() and
pnv_pci_ioda2_set_bypass() which do correct resource release and
TVT update.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:53 +10:00
Alexey Kardashevskiy
4793d65d1a vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API
This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.

create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.

set_window() sets the window at specified TVT index + @num on PHB.

unset_window() unsets the window from specified TVT.

This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.

create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.

This adds IOMMU capabilities to iommu_table_group such as default
32bit window parameters and others. This makes use of new values in
vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
advertise pagemasks to the userspace.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:52 +10:00
Alexey Kardashevskiy
bbb845c4ba powerpc/powernv: Implement multilevel TCE tables
TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
on huge guests (hundreds of GB of RAM) so the kernel might be unable to
allocate contiguous chunk of physical memory to store the TCE table.

To address this, POWER8 CPU (actually, IODA2) supports multi-level
TCE tables, up to 5 levels which splits the table into a tree of
smaller subtables.

This adds multi-level TCE tables support to
pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages()
helpers.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:51 +10:00
Alexey Kardashevskiy
43cb60ab7f powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
This is a part of moving DMA window programming to an iommu_ops
callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
a first parameter (not pnv_ioda_pe) as it is going to be used as
a callback for VFIO DDW code.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:51 +10:00
Alexey Kardashevskiy
aca6913f55 powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages
This is a part of moving TCE table allocation into an iommu_ops
callback to support multiple IOMMU groups per one VFIO container.

This moves the code which allocates the actual TCE tables to helpers:
pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages().
These do not allocate/free the iommu_table struct.

This enforces window size to be a power of two.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:50 +10:00
Alexey Kardashevskiy
e5aad1e678 powerpc/powernv/ioda2: Rework iommu_table creation
This moves iommu_table creation to the beginning to make following changes
easier to review. This starts using table parameters from the iommu_table
struct.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:50 +10:00
Alexey Kardashevskiy
05c6cfb9dc powerpc/iommu/powernv: Release replaced TCE
At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE and DMA direction so
the caller can release the pages afterwards. The exchange() receives
a physical address unlike set() which receives linear mapping address;
and returns a physical address as the clear() does.

This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
for a platform to have exchange() implemented in order to support VFIO.

This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().

This makes sure that TCE permission bits are not set in TCE passed to
IOMMU API as those are to be calculated by platform code from
DMA direction.

This moves SetPageDirty() to the IOMMU code to make it work for both
VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
available later).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:49 +10:00
Alexey Kardashevskiy
c5bb44edee powerpc/powernv: Implement accessor to TCE entry
This replaces direct accesses to TCE table with a helper which
returns an TCE entry address. This does not make difference now but will
when multi-level TCE tables get introduces.

No change in behavior is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:49 +10:00
Alexey Kardashevskiy
e57080f17d powerpc/powernv/ioda2: Add TCE invalidation for all attached groups
The iommu_table struct keeps a list of IOMMU groups it is used for.
At the moment there is just a single group attached but further
patches will add TCE table sharing. When sharing is enabled, TCE cache
in each PE needs to be invalidated so does the patch.

This does not change pnv_pci_ioda1_tce_invalidate() as there is no plan
to enable TCE table sharing on PHBs older than IODA2.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:48 +10:00
Alexey Kardashevskiy
5780fb0426 powerpc/powernv/ioda2: Move TCE kill register address to PE
At the moment the DMA setup code looks for the "ibm,opal-tce-kill"
property which contains the TCE kill register address. Writing to
this register invalidates TCE cache on IODA/IODA2 hub.

This moves the register address from iommu_table to pnv_pnb as this
register belongs to PHB and invalidates TCE cache for all tables of
all attached PEs.

This moves the property reading/remapping code to a helper which is
called when DMA is being configured for PE and which does DMA setup
for both IODA1 and IODA2.

This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which
invalidates cache for the entire table. It should be called after
every call to opal_pci_map_pe_dma_window(). It was not required before
because there was just a single TCE table and 64bit DMA was handled via
bypass window (which has no table so no cache was used) but this is going
to change with Dynamic DMA windows (DDW).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:48 +10:00
Alexey Kardashevskiy
f87a88642e vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control
This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
which call in a loop iommu_take_ownership()/iommu_release_ownership()
for every table on the group. As there is just one now, no change in
behaviour is expected.

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds take_ownership()/release_ownership() callbacks to it which are
called when an external user takes/releases control over the IOMMU.

This replaces set_bypass() with ownership callbacks as it is not
necessarily just bypass enabling, it can be something else/more
so let's give it more generic name.

The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
The following patches will replace iommu_take_ownership/
iommu_release_ownership calls in IODA2 with full IOMMU table release/
create.

As we here and touching bypass control, this removes
pnv_pci_ioda2_setup_bypass_pe() as it does not do much
more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
initialization to pnv_pci_ioda2_setup_dma_pe.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:47 +10:00