linux

iv/linux

Author	SHA1	Message	Date
Michael Ellerman	9b7e4d601b	Merge branch 'fixes' into next Merge our fixes branch. It has a few important fixes that are needed for futher testing and also some commits that will conflict with content in next.	2018-10-09 16:51:05 +11:00
Mark Hairgrove	f86ad3e019	powerpc/powernv/npu: Remove atsd_threshold debugfs setting This threshold is no longer used now that all invalidates issue a single ATSD to each active NPU. Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Reviewed-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-10-04 16:55:54 +10:00
Mark Hairgrove	3689c37d23	powerpc/powernv/npu: Use size-based ATSD invalidates Prior to this change only two types of ATSDs were issued to the NPU: invalidates targeting a single page and invalidates targeting the whole address space. The crossover point happened at the configurable atsd_threshold which defaulted to 2M. Invalidates that size or smaller would issue per-page invalidates for the whole range. The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all. These invalidates target addresses aligned to their size. 2M is a common invalidation size for GPU-enabled applications because that is a GPU page size, so reducing the number of invalidates by 32x in that case is a clear improvement. ATSD latency is high in general so now we always issue a single invalidate rather than multiple. This will over-invalidate in some cases, but for any invalidation size over 2M it matches or improves the prior behavior. There's also an improvement for single-page invalidates since the prior version issued two invalidates for that case instead of one. With this change all issued ATSDs now perform a flush, so the flush parameter has been removed from all the helpers. To show the benefit here are some performance numbers from a microbenchmark which creates a 1G allocation then uses mprotect with PROT_NONE to trigger invalidates in strides across the allocation. One NPU (1 GPU): mprotect rate (GB/s) Stride Before After Speedup 64K 5.3 5.6 5% 1M 39.3 57.4 46% 2M 49.7 82.6 66% 4M 286.6 285.7 0% Two NPUs (6 GPUs): mprotect rate (GB/s) Stride Before After Speedup 64K 6.5 7.4 13% 1M 33.4 67.9 103% 2M 38.7 93.1 141% 4M 356.7 354.6 -1% Anything over 2M is roughly the same as before since both cases issue a single ATSD. Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Reviewed-By: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-10-04 16:55:53 +10:00
Mark Hairgrove	7ead15a144	powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates There are two types of ATSDs issued to the NPU: invalidates targeting a specific virtual address and invalidates targeting the whole address space. In both cases prior to this change, the sequence was: for each NPU - Write the target address to the XTS_ATSD_AVA register - EIEIO - Write the launch value to issue the ATSD First, a target address is not required when invalidating the whole address space, so that write and the EIEIO have been removed. The AP (size) field in the launch is not needed either. Second, for per-address invalidates the above sequence is inefficient in the common case of multiple NPUs because an EIEIO is issued per NPU. This unnecessarily forces the launches of later ATSDs to be ordered with the launches of earlier ones. The new sequence only issues a single EIEIO: for each NPU - Write the target address to the XTS_ATSD_AVA register EIEIO for each NPU - Write the launch value to issue the ATSD Performance results were gathered using a microbenchmark which creates a 1G allocation then uses mprotect with PROT_NONE to trigger invalidates in strides across the allocation. With only a single NPU active (one GPU) the difference is in the noise for both types of invalidates (+/-1%). With two NPUs active (on a 6-GPU system) the effect is more noticeable: mprotect rate (GB/s) Stride Before After Speedup 64K 5.9 6.5 10% 1M 31.2 33.4 7% 2M 36.3 38.7 7% 4M 322.6 356.7 11% Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Reviewed-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-10-04 16:55:52 +10:00
Michal Suchanek	8a03e81cb1	powerpc/64s: consolidate MCE counter increment. The code in machine_check_exception excludes 64s hvmode when incrementing the MCE counter only to call opal_machine_check to increment it specifically for this case. Remove the exclusion and special case. Fixes: `a43c159042` ("powerpc/pseries: Flush SLB contents on SLB MCE errors.") Signed-off-by: Michal Suchanek <msuchanek@suse.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-10-03 15:40:06 +10:00
Breno Leitao	62dea077f5	powerpc/powernv: Mark function as __noreturn There is a mismatch between function pnv_platform_error_reboot() definition and declaration regarding function modifiers. In the declaration part, it contains the function attribute __noreturn, while function definition itself lacks it. This was reported by sparse tool as an error: arch/powerpc/platforms/powernv/opal.c:538:6: error: symbol 'pnv_platform_error_reboot' redeclared with different type (originally declared at arch/powerpc/platforms/powernv/powernv.h:11) - different modifiers I checked and the function is already being considered as being 'noreturn' by the compiler, thus, I understand this patch does not change any code being generated. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-10-03 15:40:05 +10:00
Rob Herring	b9ef7b4b86	powerpc: Convert to using %pOFn instead of device_node.name In preparation to remove the node name pointer from struct device_node, convert printf users to use the %pOFn format specifier. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-10-03 15:40:01 +10:00
Vaibhav Jain	8139046a5a	powerpc/powernv: Make possible for user to force a full ipl cec reboot Ever since fast reboot is enabled by default in opal, opal_cec_reboot() will use fast-reset instead of full IPL to perform system reboot. This leaves the user with no direct way to force a full IPL reboot except changing an nvram setting that persistently disables fast-reset for all subsequent reboots. This patch provides a more direct way for the user to force a one-shot full IPL reboot by passing the command line argument 'full' to the reboot command. So the user will be able to tweak the reboot behavior via: $ sudo reboot full # Force a full ipl reboot skipping fast-reset or $ sudo reboot # default reboot path (usually fast-reset) The reboot command passes the un-parsed command argument to the kernel via the 'Reboot' syscall which is then passed on to the arch function pnv_restart(). The patch updates pnv_restart() to handle this cmd-arg and issues opal_cec_reboot2 with OPAL_REBOOT_FULL_IPL to force a full IPL reset. Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-10-03 15:39:45 +10:00
Alexey Kardashevskiy	7233b8cab3	powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again) mpe: This was fixed originally in commit `d3d4ffaae4` ("powerpc/powernv/ioda2: Reduce upper limit for DMA window size"), but contrary to what the merge commit says was inadvertently lost by me in commit `ce57c6610c` ("Merge branch 'topic/ppc-kvm' into next") which brought in changes that moved the code to a new file. So reapply it to the new file. Original commit message follows: We use PHB in mode1 which uses bit 59 to select a correct DMA window. However there is mode2 which uses bits 59:55 and allows up to 32 DMA windows per a PE. Even though documentation does not clearly specify that, it seems that the actual hardware does not support bits 59:55 even in mode1, in other words we can create a window as big as 1<<58 but DMA simply won't work. This reduces the upper limit from 59 to 55 bits to let the userspace know about the hardware limits. Fixes: `ce57c6610c` ("Merge branch 'topic/ppc-kvm' into next") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-09-20 14:31:03 +10:00
Mahesh Salgaonkar	a43c159042	powerpc/pseries: Flush SLB contents on SLB MCE errors. On pseries, as of today system crashes if we get a machine check exceptions due to SLB errors. These are soft errors and can be fixed by flushing the SLBs so the kernel can continue to function instead of system crash. We do this in real mode before turning on MMU. Otherwise we would run into nested machine checks. This patch now fetches the rtas error log in real mode and flushes the SLBs on SLB/ERAT errors. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michal Suchanek <msuchanek@suse.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-09-19 21:59:22 +10:00
Rashmica Gupta	3f7daf3d75	powerpc/memtrace: Remove memory in chunks When hot-removing memory release_mem_region_adjustable() splits iomem resources if they are not the exact size of the memory being hot-deleted. Adding this memory back to the kernel adds a new resource. Eg a node has memory 0x0 - 0xfffffffff. Hot-removing 1GB from 0xf40000000 results in the single resource 0x0-0xfffffffff being split into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff. When we hot-add the memory back we now have three resources: 0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff. This is an issue if we try to remove some memory that overlaps resources. Eg when trying to remove 2GB at address 0xf40000000, release_mem_region_adjustable() fails as it expects the chunk of memory to be within the boundaries of a single resource. We then get the warning: "Unable to release resource" and attempting to use memtrace again gives us this error: "bash: echo: write error: Resource temporarily unavailable" This patch makes memtrace remove memory in chunks that are always the same size from an address that is always equal to end_of_memory - n*size, for some n. So hotremoving and hotadding memory of different sizes will now not attempt to remove memory that spans multiple resources. Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-09-19 21:58:02 +10:00
Joel Stanley	b0dc0f8618	powerpc/powernv: Don't select the cpufreq governors Deciding wich govenors should be built into the kernel can be left to users to configure. Fixes: `81f359027a` ("cpufreq: powernv: Select CPUFreq related Kconfig options for powernv") Signed-off-by: Joel Stanley <joel@jms.id.au> [mpe: Update powernv/ppc64 defconfigs to enable them by default] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-09-17 21:16:26 +10:00
Linus Torvalds	aba16dc5cf	Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax Pull IDA updates from Matthew Wilcox: "A better IDA API: id = ida_alloc(ida, GFP_xxx); ida_free(ida, id); rather than the cumbersome ida_simple_get(), ida_simple_remove(). The new IDA API is similar to ida_simple_get() but better named. The internal restructuring of the IDA code removes the bitmap preallocation nonsense. I hope the net -200 lines of code is convincing" * 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits) ida: Change ida_get_new_above to return the id ida: Remove old API test_ida: check_ida_destroy and check_ida_alloc test_ida: Convert check_ida_conv to new API test_ida: Move ida_check_max test_ida: Move ida_check_leaf idr-test: Convert ida_check_nomem to new API ida: Start new test_ida module target/iscsi: Allocate session IDs from an IDA iscsi target: fix session creation failure handling drm/vmwgfx: Convert to new IDA API dmaengine: Convert to new IDA API ppc: Convert vas ID allocation to new IDA API media: Convert entity ID allocation to new IDA API ppc: Convert mmu context allocation to new IDA API Convert net_namespace to new IDA API cb710: Convert to new IDA API rsxx: Convert to new IDA API osd: Convert to new IDA API sd: Convert to new IDA API ...	2018-08-26 11:48:42 -07:00
Linus Torvalds	aa5b1054ba	Merge tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - An implementation for the newly added hv_ops->flush() for the OPAL hvc console driver backends, I forgot to apply this after merging the hvc driver changes before the merge window. - Enable all PCI bridges at boot on powernv, to avoid races when multiple children of a bridge try to enable it simultaneously. This is a workaround until the PCI core can be enhanced to fix the races. - A fix to query PowerVM for the correct system topology at boot before initialising sched domains, seen in some configurations to cause broken scheduling etc. - A fix for pte_access_permitted() on "nohash" platforms. - Two commits to fix SIGBUS when using remap_pfn_range() seen on Power9 due to a workaround when using the nest MMU (GPUs, accelerators). - Another fix to the VFIO code used by KVM, the previous fix had some bugs which caused guests to not start in some configurations. - A handful of other minor fixes. Thanks to: Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy, Hari Bathini, Luke Dashjr, Mahesh Salgaonkar, Nicholas Piggin, Paul Mackerras, Srikar Dronamraju. * tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mce: Fix SLB rebolting during MCE recovery path. KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pages powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transition powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid. powerpc/nohash: fix pte_access_permitted() powerpc/topology: Get topology for shared processors at boot powerpc64/ftrace: Include ftrace.h needed for enable/disable calls powerpc/powernv/pci: Work around races in PCI bridge enabling powerpc/fadump: cleanup crash memory ranges support powerpc/powernv: provide a console flush operation for opal hvc driver powerpc/traps: Avoid rate limit messages from show unhandled signals powerpc/64s: Fix PACA_IRQ_HARD_DIS accounting in idle_power4()	2018-08-24 09:34:23 -07:00
Matthew Wilcox	cd38049e48	ppc: Convert vas ID allocation to new IDA API Removes a custom spinlock and simplifies the code. Also fix an error where we could allocate one ID too many. Signed-off-by: Matthew Wilcox <willy@infradead.org>	2018-08-21 23:54:19 -04:00
Benjamin Herrenschmidt	db2173198b	powerpc/powernv/pci: Work around races in PCI bridge enabling The generic code is racy when multiple children of a PCI bridge try to enable it simultaneously. This leads to drivers trying to access a device through a not-yet-enabled bridge, and this EEH errors under various circumstances when using parallel driver probing. There is work going on to fix that properly in the PCI core but it will take some time. x86 gets away with it because (outside of hotplug), the BIOS enables all the bridges at boot time. This patch does the same thing on powernv by enabling all bridges that have child devices at boot time, thus avoiding subsequent races. It's suitable for backporting to stable and distros, while the proper PCI fix will probably be significantly more invasive. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-20 20:29:17 +10:00
Nicholas Piggin	95b861a76c	powerpc/powernv: provide a console flush operation for opal hvc driver Provide the flush hv_op for the opal hvc driver. This will flush the firmware console buffers without spinning with interrupts disabled. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-20 20:19:54 +10:00
Linus Torvalds	5e2d059b52	Merge tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Notable changes: - A fix for a bug in our page table fragment allocator, where a page table page could be freed and reallocated for something else while still in use, leading to memory corruption etc. The fix reuses pt_mm in struct page (x86 only) for a powerpc only refcount. - Fixes to our pkey support. Several are user-visible changes, but bring us in to line with x86 behaviour and/or fix outright bugs. Thanks to Florian Weimer for reporting many of these. - A series to improve the hvc driver & related OPAL console code, which have been seen to cause hardlockups at times. The hvc driver changes in particular have been in linux-next for ~month. - Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y. - Remove Power8 DD1 and Power9 DD1 support, neither chip should be in use anywhere other than as a paper weight. - An optimised memcmp implementation using Power7-or-later VMX instructions - Support for barrier_nospec on some NXP CPUs. - Support for flushing the count cache on context switch on some IBM CPUs (controlled by firmware), as a Spectre v2 mitigation. - A series to enhance the information we print on unhandled signals to bring it into line with other arches, including showing the offending VMA and dumping the instructions around the fault. Thanks to: Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Alexey Spirkov, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Bartosz Golaszewski, Benjamin Herrenschmidt, Bharat Bhushan, Bjoern Noetel, Boqun Feng, Breno Leitao, Bryant G. Ly, Camelia Groza, Christophe Leroy, Christoph Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt, Darren Stevens, Dave Young, David Gibson, Diana Craciun, Finn Thain, Florian Weimer, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand, Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel Stanley, Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues, Michael Hanselmann, Michael Neuling, Michael Schmitz, Mukesh Ojha, Murilo Opsfelder Araujo, Nicholas Piggin, Parth Y Shah, Paul Mackerras, Paul Menzel, Ram Pai, Randy Dunlap, Rashmica Gupta, Reza Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff, Scott Wood, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat Rao, zhong jiang" * tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (234 commits) powerpc/mm/book3s/radix: Add mapping statistics powerpc/uaccess: Enable get_user(u64, *p) on 32-bit powerpc/mm/hash: Remove unnecessary do { } while(0) loop powerpc/64s: move machine check SLB flushing to mm/slb.c powerpc/powernv/idle: Fix build error powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range powerpc/mm: remove warning about ‘type’ being set powerpc/32: Include setup.h header file to fix warnings powerpc: Move `path` variable inside DEBUG_PROM powerpc/powermac: Make some functions static powerpc/powermac: Remove variable x that's never read cxl: remove a dead branch powerpc/powermac: Add missing include of header pmac.h powerpc/kexec: Use common error handling code in setup_new_fdt() powerpc/xmon: Add address lookup for percpu symbols powerpc/mm: remove huge_pte_offset_and_shift() prototype powerpc/lib: Use patch_site to patch copy_32 functions once cache is enabled powerpc/pseries: Fix endianness while restoring of r3 in MCE handler. powerpc/fadump: merge adjacent memory ranges to reduce PT_LOAD segements powerpc/fadump: handle crash memory ranges array index overflow ...	2018-08-17 11:32:50 -07:00
Aneesh Kumar K.V	ae24ce5e12	powerpc/powernv/idle: Fix build error Fix the below build error using strlcpy instead of strncpy In function 'pnv_parse_cpuidle_dt', inlined from 'pnv_init_idle_states' at arch/powerpc/platforms/powernv/idle.c:840:7, inlined from '__machine_initcall_powernv_pnv_init_idle_states' at arch/powerpc/platforms/powernv/idle.c:870:1: arch/powerpc/platforms/powernv/idle.c:820:3: error: 'strncpy' specified bound 16 equals destination size [-Werror=stringop-truncation] strncpy(pnv_idle_states[i].name, temp_string[i], ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PNV_IDLE_NAME_LEN); Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-10 22:12:39 +10:00
Rashmica Gupta	d3da701d33	powerpc/powernv: Allow memory that has been hot-removed to be hot-added This patch allows the memory removed by memtrace to be readded to the kernel. So now you don't have to reboot your system to add the memory back to the kernel or to have a different amount of memory removed. Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com> Tested-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-10 22:12:31 +10:00
Benjamin Herrenschmidt	77b5f703dc	powerpc/powernv/opal: Use standard interrupts property when available For (bad) historical reasons, OPAL used to create a non-standard pair of properties "opal-interrupts" and "opal-interrupts-names" for representing the list of interrupts it wants Linux to request on its behalf. Among other issues, the opal-interrupts doesn't have a way to carry the type of interrupts, and they were assumed to be all level sensitive. This is wrong on some recent systems where some of them are edge sensitive causing warnings in the XIVE code and possible misbehaviours if they need to be retriggered (typically the NPU2 TCE error interrupts). This makes Linux switch to using the standard "interrupts" and "interrupt-names" properties instead when they are available, using standard of_irq helpers, which can carry all the desired type information. Newer versions of OPAL will generate those properties in addition to the legacy ones. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Fixup prefix logic to check strlen(r->name). Reinstate setting of start = 0 in opal_event_shutdown() to avoid double free warnings] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-08 00:32:38 +10:00
Haren Myneni	656ecc16e8	crypto/nx: Initialize 842 high and normal RxFIFO control registers NX increments readOffset by FIFO size in receive FIFO control register when CRB is read. But the index in RxFIFO has to match with the corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX may be processing incorrect CRBs and can cause CRB timeout. VAS FIFO offset is 0 when the receive window is opened during initialization. When the module is reloaded or in kexec boot, readOffset in FIFO control register may not match with VAS entry. This patch adds nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO control register for both high and normal FIFOs. Signed-off-by: Haren Myneni <haren@us.ibm.com> [mpe: Fixup uninitialized variable warning] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-08 00:32:34 +10:00
Haren Myneni	6e708000ec	powerpc/powernv: Export opal_check_token symbol Export opal_check_token symbol for modules to check the availability of OPAL calls before using them. Signed-off-by: Haren Myneni <haren@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-08 00:32:33 +10:00
Michael Ellerman	99d54754d3	powerpc/powernv: Query firmware for count cache flush settings Look for fw-features properties to determine the appropriate settings for the count cache flush, and then call the generic powerpc code to set it up based on the security feature flags. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-08 00:32:27 +10:00
Michael Ellerman	af375eefbf	powerpc/64: Call setup_barrier_nospec() from setup_arch() Currently we require platform code to call setup_barrier_nospec(). But if we add an empty definition for the !CONFIG_PPC_BARRIER_NOSPEC case then we can call it in setup_arch(). Signed-off-by: Diana Craciun <diana.craciun@nxp.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-08 00:32:23 +10:00
Benjamin Herrenschmidt	e27e0a9465	powerpc/xive: Remove xive_kexec_teardown_cpu() It's identical to xive_teardown_cpu() so just use the latter Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-07 21:49:28 +10:00
Reza Arbab	9eab9901b0	powerpc/powernv: Fix concurrency issue with npu->mmio_atsd_usage We've encountered a performance issue when multiple processors stress {get,put}_mmio_atsd_reg(). These functions contend for mmio_atsd_usage, an unsigned long used as a bitmask. The accesses to mmio_atsd_usage are done using test_and_set_bit_lock() and clear_bit_unlock(). As implemented, both of these will require a (successful) stwcx to that same cache line. What we end up with is thread A, attempting to unlock, being slowed by other threads repeatedly attempting to lock. A's stwcx instructions fail and retry because the memory reservation is lost every time a different thread beats it to the punch. There may be a long-term way to fix this at a larger scale, but for now resolve the immediate problem by gating our call to test_and_set_bit_lock() with one to test_bit(), which is obviously implemented without using a store. Fixes: `1ab66d1fba` ("powerpc/powernv: Introduce address translation services for Nvlink2") Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Acked-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-03 20:09:01 +10:00
Nicholas Piggin	3127692deb	powernv/cpuidle: Fix idle states all being marked invalid Commit `9c7b185ab2` ("powernv/cpuidle: Parse dt idle properties into global structure") parses dt idle states into structs, but never marks them valid. This results in all idle states being lost. Fixes: `9c7b185ab2` ("powernv/cpuidle: Parse dt idle properties into global structure") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-08-03 18:30:33 +10:00
Hari Vyas	44bda4b7d2	PCI: Fix is_added/is_busmaster race condition When a PCI device is detected, pdev->is_added is set to 1 and proc and sysfs entries are created. When the device is removed, pdev->is_added is checked for one and then device is detached with clearing of proc and sys entries and at end, pdev->is_added is set to 0. is_added and is_busmaster are bit fields in pci_dev structure sharing same memory location. A strange issue was observed with multiple removal and rescan of a PCIe NVMe device using sysfs commands where is_added flag was observed as zero instead of one while removing device and proc,sys entries are not cleared. This causes issue in later device addition with warning message "proc_dir_entry" already registered. Debugging revealed a race condition between the PCI core setting the is_added bit in pci_bus_add_device() and the NVMe driver reset work-queue setting the is_busmaster bit in pci_set_master(). As these fields are not handled atomically, that clears the is_added bit. Move the is_added bit to a separate private flag variable and use atomic functions to set and retrieve the device addition state. This avoids the race because is_added no longer shares a memory location with is_busmaster. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200283 Signed-off-by: Hari Vyas <hari.vyas@broadcom.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Lukas Wunner <lukas@wunner.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-31 11:27:54 -05:00
Shilpasri G Bhat	04baaf28f4	powerpc/powernv: Add support to enable sensor groups Adds support to enable/disable a sensor group at runtime. This can be used to select the sensor groups that needs to be copied to main memory by OCC. Sensor groups like power, temperature, current, voltage, frequency, utilization can be enabled/disabled at runtime. Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-31 19:56:45 +10:00
Akshay Adiga	9c7b185ab2	powernv/cpuidle: Parse dt idle properties into global structure Device-tree parsing happens twice, once while deciding idle state to be used for hotplug and once during cpuidle init. Hence, parsing the device tree and caching it will reduce code duplication. Parsing code has been moved to pnv_parse_cpuidle_dt() from pnv_probe_idle_states(). In addition to the properties in the device tree the number of available states is also required. Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-31 19:56:44 +10:00
Christophe Leroy	2c86cd188f	powerpc: clean inclusions of asm/feature-fixups.h files not using feature fixup don't need asm/feature-fixups.h files using feature fixup need asm/feature-fixups.h Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-30 22:48:17 +10:00
Christophe Leroy	5c35a02c54	powerpc: clean the inclusion of stringify.h Only include linux/stringify.h is files using __stringify() Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-30 22:48:17 +10:00
Christophe Leroy	ec0c464cdb	powerpc: move ASM_CONST and stringify_in_c() into asm-const.h This patch moves ASM_CONST() and stringify_in_c() into dedicated asm-const.h, then cleans all related inclusions. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> [mpe: asm-compat.h should include asm-const.h] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-30 22:48:16 +10:00
Nicholas Piggin	17cc1dd492	powerpc/powernv: implement opal_put_chars_atomic The RAW console does not need writes to be atomic, so relax opal_put_chars to be able to do partial writes, and implement an _atomic variant which does not take a spinlock. This API is used in xmon, so the less locking that is used, the better chance there is that a crash can be debugged. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-24 22:09:57 +10:00
Nicholas Piggin	ac4ac788fd	powerpc/powernv: move opal console flushing to udbg OPAL console writes do not have to synchronously flush firmware / hardware buffers unless they are going through the udbg path. Remove the unconditional flushing from opal_put_chars. Flush if there was no space in the buffer as an optimisation (callers loop waiting for success in that case). udbg flushing is moved to udbg_opal_putc. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-24 22:09:57 +10:00
Nicholas Piggin	b74d2807ae	powerpc/powernv: Remove OPALv1 support from opal console driver opal_put_chars deals with partial writes because in OPALv1, opal_console_write_buffer_space did not work correctly. That firmware is not supported. This reworks the opal_put_chars code to no longer deal with partial writes by turning them into full writes. Partial write handling is still supported in terms of what gets returned to the caller, but it may not go to the console atomically. A warning message is printed in this case. This allows console flushing to be moved out of the opal_write_lock spinlock. That could cause the lock to be held for long periods if the console is busy (especially if it was being spammed by firmware), which is dangerous because the lock is taken by xmon to debug the system. Flushing outside the lock improves the situation a bit. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-24 22:09:56 +10:00
Nicholas Piggin	d2a2262e68	powerpc/powernv: Implement and use opal_flush_console A new console flushing firmware API was introduced to replace event polling loops, and implemented in opal-kmsg with `affddff69c` ("powerpc/powernv: Add a kmsg_dumper that flushes console output on panic"), to flush the console in the panic path. The OPAL console driver has other situations where interrupts are off and it needs to flush the console synchronously. These still use a polling loop. So move the opal-kmsg flush code to opal_flush_console, and use the new function in opal-kmsg and opal_put_chars. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-24 22:09:56 +10:00
Nicholas Piggin	e00da0f2db	powerpc/powernv: opal-kmsg use flush fallback from console code Use the more refined and tested event polling loop from opal_put_chars as the fallback console flush in the opal-kmsg path. This loop is used by the console driver today, whereas the opal-kmsg fallback is not likely to have been used for years. Use WARN_ONCE rather than a printk when the fallback is invoked to prepare for moving the console flush into a common function. Reviewed-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-24 22:09:56 +10:00
Nicholas Piggin	3a80bfc7ea	powerpc/powernv: opal-kmsg standardise OPAL_BUSY handling OPAL_CONSOLE_FLUSH is documented as being able to return OPAL_BUSY, so implement the standard OPAL_BUSY handling for it. Reviewed-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-24 22:09:55 +10:00
Nicholas Piggin	36d2dabc87	powerpc/powernv: Fix OPAL console driver OPAL_BUSY loops The OPAL console driver does not delay in case it gets OPAL_BUSY or OPAL_BUSY_EVENT from firmware. It can't yet be made to sleep because it is called under spinlock, but it can be changed to the standard OPAL_BUSY loop form, and a delay added to keep it from hitting the firmware too frequently. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-24 22:09:55 +10:00
Nicholas Piggin	bd90284cc6	powerpc/powernv: opal_put_chars partial write fix The intention here is to consume and discard the remaining buffer upon error. This works if there has not been a previous partial write. If there has been, then total_len is no longer total number of bytes to copy. total_len is always "bytes left to copy", so it should be added to written bytes. This code may not be exercised any more if partial writes will not be hit, but this is a small bugfix before a larger change. Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-24 22:09:54 +10:00
Mukesh Ojha	b29336c0e1	powerpc/powernv/opal-dump : Use IRQ_HANDLED instead of numbers in interrupt handler Fixes: `8034f715f` ("powernv/opal-dump: Convert to irq domain") Converts all the return explicit number to a more proper IRQ_HANDLED, which looks proper incase of interrupt handler returning case. Here, It also removes error message like "nobody cared" which was getting unveiled while returning -1 or 0 from handler. Signed-off-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-24 22:03:24 +10:00
Mukesh Ojha	a5bbe8fd29	powerpc/powernv/opal-dump : Handles opal_dump_info properly Moves the return value check of 'opal_dump_info' to a proper place which was previously unnecessarily filling all the dump info even on failure. Signed-off-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com> Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com> Acked-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-24 22:03:23 +10:00
Alistair Popple	99c3ce33a0	powerpc/powernv/npu: Add a debugfs setting to change ATSD threshold The threshold at which it becomes more efficient to coalesce a range of ATSDs into a single per-PID ATSD is currently not well understood due to a lack of real-world work loads. This patch adds a debugfs parameter allowing the threshold to be altered at runtime in order to aid future development and refinement of the value. Signed-off-by: Alistair Popple <alistair@popple.id.au> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-19 21:58:10 +10:00
Michael Ellerman	ce57c6610c	Merge branch 'topic/ppc-kvm' into next Merge in some commits we're sharing with the KVM tree. I manually propagated the change from commit `d3d4ffaae4` ("powerpc/powernv/ioda2: Reduce upper limit for DMA window size") into pci-ioda-tce.c. Conflicts: arch/powerpc/include/asm/cputable.h arch/powerpc/platforms/powernv/pci-ioda.c arch/powerpc/platforms/powernv/pci.h	2018-07-19 14:37:57 +10:00
Alexey Kardashevskiy	a68bd1267b	powerpc/powernv/ioda: Allocate indirect TCE levels on demand At the moment we allocate the entire TCE table, twice (hardware part and userspace translation cache). This normally works as we normally have contigous memory and the guest will map entire RAM for 64bit DMA. However if we have sparse RAM (one example is a memory device), then we will allocate TCEs which will never be used as the guest only maps actual memory for DMA. If it is a single level TCE table, there is nothing we can really do but if it a multilevel table, we can skip allocating TCEs we know we won't need. This adds ability to allocate only first level, saving memory. This changes iommu_table::free() to avoid allocating of an extra level; iommu_table::set() will do this when needed. This adds @alloc parameter to iommu_table::exchange() to tell the callback if it can allocate an extra level; the flag is set to "false" for the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns H_TOO_HARD. This still requires the entire table to be counted in mm::locked_vm. To be conservative, this only does on-demand allocation when the usespace cache table is requested which is the case of VFIO. The example math for a system replicating a powernv setup with NVLink2 in a guest: 16GB RAM mapped at 0x0 128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000 the table to cover that all with 64K pages takes: (((0x244000000000 + 0x2000000000) >> 16)8)>>20 = 4556MB If we allocate only necessary TCE levels, we will only need: (((0x400000000 + 0x400000000) >> 16)8)>>20 = 4MB (plus some for indirect levels). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-16 22:53:11 +10:00
Alexey Kardashevskiy	9bc98c8a43	powerpc/powernv: Rework TCE level allocation This moves actual pages allocation to a separate function which is going to be reused later in on-demand TCE allocation. While we are at it, remove unnecessary level size round up as the caller does this already. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-16 22:53:10 +10:00
Alexey Kardashevskiy	090bad39b2	powerpc/powernv: Add indirect levels to it_userspace We want to support sparse memory and therefore huge chunks of DMA windows do not need to be mapped. If a DMA window big enough to require 2 or more indirect levels, and a DMA window is used to map all RAM (which is a default case for 64bit window), we can actually save some memory by not allocation TCE for regions which we are not going to map anyway. The hardware tables alreary support indirect levels but we also keep host-physical-to-userspace translation array which is allocated by vmalloc() and is a flat array which might use quite some memory. This converts it_userspace from vmalloc'ed array to a multi level table. As the format becomes platform dependend, this replaces the direct access to it_usespace with a iommu_table_ops::useraddrptr hook which returns a pointer to the userspace copy of a TCE; future extension will return NULL if the level was not allocated. This should not change non-KVM handling of TCE tables and it_userspace will not be allocated for non-KVM tables. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-16 22:53:10 +10:00
Alexey Kardashevskiy	191c22879f	powerpc/powernv: Move TCE manupulation code to its own file Right now we have allocation code in pci-ioda.c and traversing code in pci.c, let's keep them toghether. However both files are big enough already so let's move this business to a new file. While we at it, move the code which links IOMMU table groups to IOMMU tables as it is not specific to any PNV PHB model. These puts exported symbols from the new file together. This fixes several warnings from checkpatch.pl like this: "WARNING: Prefer 'unsigned int' to bare use of 'unsigned'". As this is almost cut-n-paste, there should be no behavioral change. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-07-16 22:53:07 +10:00

1 2 3 4 5 ...

1109 Commits