132280 Commits

Author SHA1 Message Date
Nicholas Piggin
75bda95048 powerpc: Don't print cpu_spec->cpu_name if it's NULL
Currently we assume that if the cpu_spec has a pvr_mask then it must also have a
cpu_name. But that will change in a subsequent commit when we do CPU feature
discovery via the device tree, so check explicitly if cpu_name is NULL.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-09 22:56:03 +10:00
Michael Ellerman
0b382fb3d9 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into next
Freescale updates from Scott:

"Includes a fix for a powerpc/next mm regression on 64e, a fix for a
kernel hang on 64e when using a debugger inside a relocated kernel, a
qman fix, and misc qe improvements."
2017-05-09 22:54:35 +10:00
Nicholas Piggin
6102c005a7 powerpc/64s: Fix unnecessary machine check handler relocation branch
Similarly to commit 2563a70c3b ("powerpc/64s: Remove unnecessary relocation
branch from idle handler"), the machine check handler has a BRANCH_TO from
relocated to relocated code, which is unnecessary.

It has also caused build errors with some toolchains:

  arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
  arch/powerpc/kernel/exceptions-64s.S:395: Error: operand out of range
  (0xffffffffffff8280 is not between 0x0000000000000000 and
  0x000000000000ffff)

Fixes: 1945bc4549e5 ("powerpc/64s: Fix POWER9 machine check handler from stop state")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reported-and-tested-by : Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-09 19:25:50 +10:00
Michael Ellerman
ba95b5d035 powerpc/mm/book3s/64: Rework page table geometry for lower memory usage
Recently in commit f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
we increased the virtual address space for user processes to 128TB by default,
and up to 512TB if user space opts in.

This obviously required expanding the range of the Linux page tables. For Book3s
64-bit using hash and with PAGE_SIZE=64K, we increased the PGD to 2^15 entries.
This meant we could cover the full address range, while still being able to
insert a 16G hugepage at the PGD level and a 16M hugepage in the PMD.

The downside of that geometry is that it uses a lot of memory for the PGD, and
in particular makes the PGD a 4-page allocation, which means it's much more
likely to fail under memory pressure.

Instead we can make the PMD larger, so that a single PUD entry maps 16G,
allowing the 16G hugepages to sit at that level in the tree. We're then able to
split the remaining bits between the PUG and PGD. We make the PGD slightly
larger as that results in lower memory usage for typical programs.

When THP is enabled the PMD actually doubles in size, to 2^11 entries, or 2^14
bytes, which is large but still < PAGE_SIZE.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2017-05-09 19:24:23 +10:00
Horia Geantă
24e0bfbf63 powerpc: Fix distclean with Makefile.postlink
Makefile.postlink always includes include/config/auto.conf, however
this file is not present in a clean kernel tree, causing make to fail:

  $ git clone linuxppc.git
  $ cd linuxppc.git
  $ make distclean
  arch/powerpc/Makefile.postlink:10: include/config/auto.conf: No such file or directory
  make[1]: *** No rule to make target `include/config/auto.conf'.  Stop.
  make: *** [vmlinuxclean] Error 2

Equally running 'make distclean; make distclean' will trip the error case.

Change the inclusion such that file not being found does not trigger an error.

Fixes: f188d0524d7e ("powerpc: Use the new post-link pass to check relocations")
Reported-by: Mircea Pop <mircea.pop@nxp.com>
Signed-off-by: Horia Geantă <horia.geanta@nxp.com>
Tested-by: Justin M. Forbes <jforbes@fedoraproject.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-09 19:24:23 +10:00
Scott Wood
61baf15555 powerpc/64e: Don't place the stack beyond TASK_SIZE
Commit f4ea6dcb08ea ("powerpc/mm: Enable mappings above 128TB") increased
the task size on book3s, and introduced a mechanism to dynamically
control whether a task uses these larger addresses.  While the change to
the task size itself was ifdef-protected to only apply on book3s, the
change to STACK_TOP_USER64 was not.  On book3e, this had the effect of
trying to use addresses up to 128TiB for the stack despite a 64TiB task
size limit -- which broke 64-bit userspace producing the following errors:

Starting init: /sbin/init exists but couldn't execute it (error -14)
Starting init: /bin/sh exists but couldn't execute it (error -14)
Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.

Fixes: f4ea6dcb08ea ("powerpc/mm: Enable mappings above 128TB")
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Scott Wood <oss@buserror.net>
2017-05-05 01:22:06 -05:00
Gavin Shan
c374ed27c9 powerpc/powernv: Block PCI config access on BCM5718 during EEH recovery
Similar to what is done in commit b6541db13952 ("powerpc/eeh: Block
PCI config access upon frozen PE"), we need block PCI config access
for BCM5719 when recovering frozen error on them. Otherwise, an
unexpected recursive fenced PHB error is observed.

   0001:06:00.0 Ethernet controller: Broadcom Corporation \
                NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
   0001:06:00.1 Ethernet controller: Broadcom Corporation \
                NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-04 10:57:56 +10:00
Nicholas Piggin
700b7eadd5 powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
Power9/ISAv3 has no VRMASD field in LPCR, we shouldn't be setting reserved bits,
so don't set them on Power9.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03 20:45:55 +10:00
Alistair Popple
6b3d12a948 powerpc/powernv: Fix TCE kill on NVLink2
Commit 616badd2fb49 ("powerpc/powernv: Use OPAL call for TCE kill on
NVLink2") forced all TCE kills to go via the OPAL call for
NVLink2. However the PHB3 implementation of TCE kill was still being
called directly from some functions which in some circumstances caused
a machine check.

This patch adds an equivalent IODA2 version of the function which uses
the correct invalidation method depending on PHB model and changes all
external callers to use it instead.

Fixes: 616badd2fb49 ("powerpc/powernv: Use OPAL call for TCE kill on NVLink2")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03 20:45:55 +10:00
Michael Ellerman
3c9ac2bcc3 powerpc/mm/radix: Drop support for CPUs without lockless tlbie
Currently the radix TLB code includes support for CPUs that do *not*
have MMU_FTR_LOCKLESS_TLBIE. On those CPUs we are required to take a
global spinlock before issuing a tlbie.

Radix can only be built for 64-bit Book3s CPUs, and of those, only
POWER4, 970, Cell and PA6T do not have MMU_FTR_LOCKLESS_TLBIE. Although
it's possible to build a kernel with Radix support that can also boot on
those CPUs, we happen to know that in reality none of those CPUs support
the Radix MMU, so the code can never actually run on those CPUs.

So remove the native_tlbie_lock in the Radix TLB code.

Note that there is another lock of the same name in the hash code, which
is unaffected by this patch.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03 20:45:45 +10:00
Mahesh Salgaonkar
d93b0ac01a powerpc/book3s/mce: Move add_taint() later in virtual mode
machine_check_early() gets called in real mode. The very first time when
add_taint() is called, it prints a warning which ends up calling opal
call (that uses OPAL_CALL wrapper) for writing it to console. If we get a
very first machine check while we are in opal we are doomed. OPAL_CALL
overwrites the PACASAVEDMSR in r13 and in this case when we are done with
MCE handling the original opal call will use this new MSR on it's way
back to opal_return. This usually leads to unexpected behaviour or the
kernel to panic. Instead move the add_taint() call later in the virtual
mode where it is safe to call.

This is broken with current FW level. We got lucky so far for not getting
very first MCE hit while in OPAL. But easily reproducible on Mambo.

Fixes: 27ea2c420cad ("powerpc: Set the correct kernel taint on machine check errors.")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03 14:45:39 +10:00
Michael Ellerman
3f2290e1b5 powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
The entire body of unregister_cpu_online() is inside an #ifdef
CONFIG_HOTPLUG_CPU block. This is ugly and means we create an empty function
when hotplug is disabled for no reason.

Instead move the #ifdef out of the function body and define the function to be
NULL in the else case. This means we'll pass NULL to cpuhp_setup_state(), but
that's fine because it accepts NULL to mean there is no teardown callback, which
is exactly what we want.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03 14:45:38 +10:00
Michael Ellerman
687b8f24f1 powerpc/smp: Document irq enable/disable after migrating IRQs
This code was until recently completely undocumented and even now the comment is
not very verbose.

We've already had one patch sent to remove the IRQ enable/disable because it's
"paradoxical and unnecessary". So document it thoroughly to save anyone else
from puzzling over it.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03 14:45:38 +10:00
Michael Ellerman
0cc68bfad4 powerpc/mpc52xx: Don't select user-visible RTAS_PROC
Otherwise we might select it when its dependenices aren't enabled,
leading to a build break.

It's default y anyway, so will be on unless someone disables it
manually.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03 14:45:37 +10:00
Andrew Donnellan
b7da123053 powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
pnv_eeh_reset() has special handling for PEs whose primary bus is the
root bus or the bus immediately underneath the root port.

The cxl bi-modal card support added in b0b5e5918ad1 ("cxl: Add
cxl_check_and_switch_mode() API to switch bi-modal cards") relies on
this behaviour when hot-resetting the CAPI adapter following a mode
switch.  Document this in pnv_eeh_reset() so we don't accidentally break
it.

Suggested-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03 14:45:37 +10:00
Christophe Leroy
726bd22310 powerpc/8xx: Adding support of IRQ in MPC8xx GPIO
This patch allows the use of IRQ to notify the change of GPIO status
on MPC8xx CPM IO ports. This then allows to associate IRQs to GPIOs
in the Device Tree.

Ex:
	CPM1_PIO_C: gpio-controller@960 {
		#gpio-cells = <2>;
		compatible = "fsl,cpm1-pario-bank-c";
		reg = <0x960 0x10>;
		fsl,cpm1-gpio-irq-mask = <0x0fff>;
		interrupts = <1 2 6 9 10 11 14 15 23 24 26 31>;
		interrupt-parent = <&CPM_PIC>;
		gpio-controller;
	};

The property 'fsl,cpm1-gpio-irq-mask' defines which of the 16 GPIOs
have the associated interrupts defined in the 'interrupts' property.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2017-05-02 22:35:00 -05:00
Russell Currey
c0b64978f0 powerpc/eeh: Clean up and document event handling functions
Remove unnecessary tags in eeh_handle_normal_event(), and add function
comments for eeh_handle_normal_event() and eeh_handle_special_event().

The only functional difference is that in the case of a PE reaching the
maximum number of failures, rather than one message telling you of this
and suggesting you reseat the device, there are two separate messages.

Suggested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-02 22:41:43 +10:00
Russell Currey
daeba2956f powerpc/eeh: Avoid use after free in eeh_handle_special_event()
eeh_handle_special_event() is called when an EEH event is detected but
can't be narrowed down to a specific PE.  This function looks through
every PE to find one in an erroneous state, then calls the regular event
handler eeh_handle_normal_event() once it knows which PE has an error.

However, if eeh_handle_normal_event() found that the PE cannot possibly
be recovered, it will free it, rendering the passed PE stale.
This leads to a use after free in eeh_handle_special_event() as it attempts to
clear the "recovering" state on the PE after eeh_handle_normal_event() returns.

Thus, make sure the PE is valid when attempting to clear state in
eeh_handle_special_event().

Fixes: 8a6b1bc70dbb ("powerpc/eeh: EEH core to handle special event")
Cc: stable@vger.kernel.org # v3.11+
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-02 22:41:43 +10:00
Nicholas Piggin
084a275e4c powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
This was a hack we added to work around the allmodconfig build breaking, see
commit fb43e8477ed9 ("powerpc: Disable RELOCATABLE for COMPILE_TEST with
PPC64").

Since we merged the thin archives support in commit 43c9127d94d6 ("powerpc: Add
option to use thin archives") this hasn't been necessary, so remove it.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-01 18:54:01 +10:00
Michael Neuling
8915bcd68b powerpc/xmon: Teach xmon oops about radix vectors
Currently if we take an oops caused by an 0x380 or 0x480 exception, we get a
print which assumes SLB problems. With radix, these vectors have different
meanings.

This patch updates the oops message to reflect these different meanings.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-01 18:52:58 +10:00
LiuHailong
fd615f69a1 powerpc/64e: Fix hang when debugging programs with relocated kernel
Debug interrupts can be taken during interrupt entry, since interrupt
entry does not automatically turn them off.  The kernel will check
whether the faulting instruction is between [interrupt_base_book3e,
__end_interrupts], and if so clear MSR[DE] and return.

However, when the kernel is built with CONFIG_RELOCATABLE, it can't use
LOAD_REG_IMMEDIATE(r14,interrupt_base_book3e) and
LOAD_REG_IMMEDIATE(r15,__end_interrupts), as they ignore relocation.
Thus, if the kernel is actually running at a different address than it
was built at, the address comparison will fail, and the exception entry
code will hang at kernel_dbg_exc.

r2(toc) is also not usable here, as r2 still holds data from the
interrupted context, so LOAD_REG_ADDR() doesn't work either.  So we use
the *name@got* to get the EV of two labels directly.

Test programs test.c shows as follows:
int main(int argc, char *argv[])
{
	if (access("/proc/sys/kernel/perf_event_paranoid", F_OK) == -1)
		printf("Kernel doesn't have perf_event support\n");
}

Steps to reproduce the bug, for example:
 1) ./gdb ./test
 2) (gdb) b access
 3) (gdb) r
 4) (gdb) s

Signed-off-by: Liu Hailong <liu.hailong6@zte.com.cn>
Signed-off-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Reviewed-by: Liu Song <liu.song11@zte.com.cn>
Reviewed-by: Huang Jian <huang.jian@zte.com.cn>
[scottwood: cleaned up commit message, and specified bad behavior
 as a hang rather than an oops to correspond to mainline kernel behavior]
Fixes: 1cb6e0649248 ("powerpc/book3e: support CONFIG_RELOCATABLE")
Cc: <stable@vger.kernel.org> # 4.4.x-
Signed-off-by: Scott Wood <oss@buserror.net>
2017-04-30 01:05:18 -05:00
Michael Ellerman
add2e1e585 powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
Michal Suchánek noticed a comment in book3s/64/mmu-hash.h about the context ids
we use for the kernel was inconsistent with the code and other comments in the
same file.

It should read 1-4 not 1-5.

While we're touching it, update "address" to "addresses" which makes more sense
as it's referring to more than one address below.

Reported-by: Michal Suchánek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 22:02:55 +10:00
Alexey Kardashevskiy
b6e1f6adce powerpc/pseries: Enable VFIO
This enables VFIO on pseries host in order to allow VFIO in nested guest under
PR KVM or DPDK in a HV guest. This adds support of the VFIO_SPAPR_TCE_IOMMU
type.

This adds exchange() callback to allow TCE updates by the SPAPR TCE IOMMU
driver in VFIO.

This initializes DMA32 window parameters in iommu_table_group as as this does
not implement VFIO_SPAPR_TCE_v2_IOMMU and VFIO_SPAPR_TCE_IOMMU just reuses the
existing DMA32 window.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:26:54 +10:00
Alexey Kardashevskiy
e49a6a2173 powerpc/powernv: Fix iommu table size calculation hook for small tables
When the userspace requests a small TCE table (which takes less than
the system page size) and more than 1 TCE level, the existing code
returns a single page size which is a bug as each additional TCE level
requires at least one page and this is what
pnv_pci_ioda2_table_alloc_pages() does. And we end up seeing
WARN_ON(!ret && ((*ptbl)->it_allocated_size != table_size))
in drivers/vfio/vfio_iommu_spapr_tce.c.

This replaces incorrect _ALIGN_UP() (which aligns zero up to zero) with
max_t() to fix the bug.

Besides removing WARN_ON(), there should be no other changes in
behaviour.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:26:54 +10:00
Alexey Kardashevskiy
82eae1afbb powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
pnv_pci_table_alloc() ignores possible failure from kzalloc_node(),
this adds a check. There are 2 callers of pnv_pci_table_alloc(),
one already checks for tbl!=NULL, this adds WARN_ON() to the other path
which only happens during boot time in IODA1 and not expected to fail.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:26:53 +10:00
Nicholas Piggin
b71c9ffb14 powerpc: Add arch/powerpc/tools directory
Move a couple of existing scripts under there. Remove scripts directory:
a script is a tool, a tool is not a script.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:26:53 +10:00
Nicholas Piggin
f188d0524d powerpc: Use the new post-link pass to check relocations
Currently powerpc has to introduce a dependency on its default build
target zImage in order to run a relocation check pass over the linked
vmlinux. This is deficient because the check is not run if the plain
vmlinux target is built, or if one of the other boot targets is built.

Switch to using the kbuild post-link pass, added in commit fbe6e37dab97
("kbuild: add arch specific post-link Makefile") in order to run this
check. In future powerpc will use this to do more complicated operations,
but initially using it for something simple is a good first step.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:26:52 +10:00
Nicholas Piggin
1cd6ed7c2e powerpc/xmon: Wait for secondaries before IPI'ing on system reset
An externally triggered system reset (e.g., via QEMU nmi command, or pseries
reset button) can cause system reset interrupts on all CPUs. In case this causes
xmon to be entered, it is undesirable for the primary (first) CPU into xmon to
trigger an NMI IPI to others, because this may cause a nested system reset
interrupt.

So spin for a time waiting for secondaries to join xmon before performing the
NMI IPI, similarly to what the crash dump code does.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Only do it when we come in from system reset, not via sysrq etc.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:26:46 +10:00
Nicholas Piggin
102c05e8dc powerpc/pseries: Implement NMI IPI with H_SIGNAL_SYS_RESET
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:02:25 +10:00
Nicholas Piggin
c64af6458e powerpc: Add struct smp_ops_t.cause_nmi_ipi operation
Have the NMI IPI code use this op when the platform defines it.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:02:25 +10:00
Nicholas Piggin
ddd703ca06 powerpc: Add NMI IPI infrastructure
Add a simple NMI IPI system that handles concurrency and reentrancy.

The platform does not have to implement a true non-maskable interrupt,
the default is to simply use the debugger break IPI message. This has
now been co-opted for a general IPI message, and users (debugger and
crash) have been reimplemented on top of the NMI system.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Incorporate incremental fixes from Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:02:25 +10:00
Nicholas Piggin
2b4f3ac564 powerpc: Mark system reset as an NMI with nmi_enter/exit()
System reset is a non-maskable interrupt from Linux's point of view
(occurs under local_irq_disable()), so it should use nmi_enter/exit.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:02:25 +10:00
Nicholas Piggin
b1ee8a3de5 powerpc/64s: Dedicated system reset interrupt stack
The system reset interrupt is used for crash/debug situations, so it is
desirable to have as little impact on the normal state of the system as
possible.

Currently it uses the current kernel stack to process the exception.
This stores into the stack which may be involved with the crash. The
stack pointer may be corrupted, or it may have overflowed.

Avoid or minimise these problems by creating a dedicated NMI stack for
the system reset interrupt to use.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:02:25 +10:00
Nicholas Piggin
c4f3b52ce7 powerpc/64s: Disallow system reset vs system reset reentrancy
In preparation for using a dedicated stack for system reset interrupts,
prevent a nested system reset from recovering, in order to simplify
code that is called in crash/debug path. This allows a system reset
interrupt to just use the base stack pointer.

Keep an in_nmi nesting counter similarly to the in_mce counter. Consider
the interrrupt non-recoverable if it is taken inside another system
reset.

Interrupt nesting could be allowed similarly to MCE, but system reset
is a special case that's not for normal operation, so simplicity wins
until there is requirement for nested system reset interrupts.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:02:25 +10:00
Nicholas Piggin
a3d96f70c1 powerpc/64s: Fix system reset vs general interrupt reentrancy
The system reset interrupt can occur when MSR_EE=0, and it currently
uses the PACA_EXGEN save area.

Some PACA_EXGEN interrupts have a window where MSR_RI=1 and MSR_EE=0
when the save area is still in use. A system reset interrupt in this
window can lead to undetected corruption when the save area gets
overwritten.

This patch introduces PACA_EXNMI save area for system reset exceptions,
which closes this corruption window. It's also helpful to retain the
EXGEN state for debugging situations, even if not considering the
recoverability aspect.

This patch also moves the PACA_EXMC area down to a less frequently used
part of the paca with the new save area.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:02:25 +10:00
Nicholas Piggin
a4087a4d38 powerpc/64s: Exception macro for stack frame and initial register save
This code is common to a few exceptions, and another user will be added.
This causes a trivial change to generated code:

-     604: std     r9,416(r1)
-     608: mfspr   r11,314
-     60c: std     r11,368(r1)
-     610: mfspr   r12,315
+     604: mfspr   r11,314
+     608: mfspr   r12,315
+     60c: std     r9,416(r1)
+     610: std     r11,368(r1)

machine_check_powernv_early could also use this, but that requires non
trivial changes to generated code, so that's for another patch.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:02:25 +10:00
Nicholas Piggin
83a980f7f4 powerpc/64s: Add exception macro that does not enable RI
Subsequent patches will add more non-RI variant exceptions, so
create a macro for it rather than open-code it.

This does not change generated instructions.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:02:25 +10:00
Nicholas Piggin
6e83985b0f powerpc/cbe: Do not process external or decremeter interrupts from sreset
Cell will wake from low power state at the system reset interrupt,
with the event encoded in SRR1, rather than waking at the interrupt
vector that corresponds to that event.

The system reset handler for this platform decodes SRR1 event reason
and calls the interrupt handler to process it directly from the system
reset handlre.

A subsequent change will treat the system reset interrupt as a Linux NMI
with its own per-CPU stack, and this will no longer work. Remove the
external and decrementer handlers from the system reset handler.

- The external exception remains raised and will fire again at the
  EE interrupt vector when system reset returns.

- The decrementer is set to 1 so it will be raised again and fire when
  the system reset returns.

It is possible to branch to an idle handler from the system reset
interrupt (like POWER does), then restore a normal stack and restore
this optimisation. But simplicity wins for now.

Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:02:25 +10:00
Nicholas Piggin
461e96a337 powerpc/pasemi: Do not process external or decrementer interrupts from sreset
PA Semi will wake from low power state at the system reset interrupt,
with the event encoded in SRR1, rather than waking at the interrupt
vector that corresponds to that event.

The system reset handler for this platform decodes SRR1 event reason
and calls the interrupt handler to process it directly from the system
reset handlre.

A subsequent change will treat the system reset interrupt as a Linux NMI
with its own per-CPU stack, and this will no longer work. Remove the
external and decrementer handlers from the system reset handler.

- The external exception remains raised and will fire again at the
  EE interrupt vector when system reset returns.

- The decrementer is set to 1 so it will be raised again and fire when
  the system reset returns.

It is possible to branch to an idle handler from the system reset
interrupt (like POWER does), then restore a normal stack and restore
this optimisation. But simplicity wins for now.

Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28 21:02:25 +10:00
Michael Ellerman
b13f6683ed Merge branch 'topic/ppc-kvm' into next
Merge the topic branch we were sharing with kvm-ppc, Paul has also
merged it.
2017-04-28 20:19:37 +10:00
Naveen N. Rao
096ff2ddba powerpc/ftrace/64: Split further based on -mprofile-kernel
Split ftrace_64.S further retaining the core ftrace 64-bit aspects
in ftrace_64.S and moving ftrace_caller() and ftrace_graph_caller() into
separate files based on -mprofile-kernel. The livepatch routines are all
now contained within the mprofile file.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:29 +10:00
Naveen N. Rao
7853f9c029 powerpc: Split ftrace bits into a separate file
entry_*.S now includes a lot more than just kernel entry/exit code. As a
first step at cleaning this up, let's split out the ftrace bits into
separate files. Also move all related tracing code into a new trace/
subdirectory.

No functional changes.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:29 +10:00
Christophe Leroy
2505820f7c powerpc/mm: Rename table dump file name
Page table dump debugfs file is named 'kernel_page_tables' on
all other architectures implementing it, while is is named
'kernel_pagetables' on powerpc. This patch renames it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:28 +10:00
Christophe Leroy
78a18dbf01 powerpc/mm: On PPC32, display 32 bits addresses in page table dump
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:28 +10:00
Christophe Leroy
fd893fe56a powerpc/mm: Fix missing page attributes in page table dump
On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used.
There is also _PAGE_SHARED that is missing.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:27 +10:00
Christophe Leroy
6c01bbd2cf powerpc/mm: Fix page table dump build on PPC32
On PPC32 (eg. mpc885_ads_defconfig), page table dump compilation fails as
follows. This is because the memory layout is slightly different on PPC32. This
patch adapts it.

  arch/powerpc/mm/dump_linuxpagetables.c: In function 'walk_pagetables':
  arch/powerpc/mm/dump_linuxpagetables.c:369:10: error: 'KERN_VIRT_START' undeclared (first use in this function)
  ...

Fixes: 8eb07b187000d ("powerpc/mm: Dump linux pagetables")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:26 +10:00
Aneesh Kumar K.V
a5998fcb92 powerpc/mm/radix: Optimise tlbiel flush all case
_tlbiel_pid() is called with a ric (Radix Invalidation Control) argument of
either RIC_FLUSH_TLB or RIC_FLUSH_ALL.

RIC_FLUSH_ALL says to invalidate the entire TLB and the Page Walk Cache (PWC).

To flush the whole TLB, we have to iterate over each set (congruence class) of
the TLB. Currently we do that and pass RIC_FLUSH_ALL each time. That is not
incorrect but it means we flush the PWC 128 times, when once would suffice.

Fix it by doing the first flush with the ric value we're passed, and then if it
was RIC_FLUSH_ALL, we downgrade it to RIC_FLUSH_TLB, because we know we have
just flushed the PWC and don't need to do it again.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Split out of combined patch, tweak logic, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:21 +10:00
Aneesh Kumar K.V
cf4f08bed8 powerpc/mm/radix: Optimise Page Walk Cache flush
Currently we implement flushing of the page walk cache (PWC) by calling
_tlbiel_pid() with a RIC (Radix Invalidation Control) value of 1 which says to
only flush the PWC.

But _tlbiel_pid() loops over each set (congruence class) of the TLB, which is
not necessary when we're just flushing the PWC.

In fact the set argument is ignored for a PWC flush, so essentially we're just
flushing the PWC 127 extra times for no benefit.

Fix it by adding tlbiel_pwc() which just does a single flush of the PWC.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Split out of combined patch, drop _ in name, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 22:20:05 +10:00
Michael Ellerman
45b21cfeb2 powerpc/powernv: Fix oops on P9 DD1 in cause_ipi()
Recently we merged the native xive support for Power9, and then separately some
reworks for doorbell IPI support. In isolation both series were OK, but the
merged result had a bug in one case.

On P9 DD1 we use pnv_p9_dd1_cause_ipi() which tries to use doorbells, and then
falls back to the interrupt controller. However the fallback is implemented by
calling icp_ops->cause_ipi. But now that xive support is merged we might be
using xive, in which case icp_ops is not initialised, it's a xics specific
structure. This leads to an oops such as:

  Unable to handle kernel paging request for data at address 0x00000028
  Oops: Kernel access of bad area, sig: 11 [#1]
  NIP pnv_p9_dd1_cause_ipi+0x74/0xe0
  LR smp_muxed_ipi_message_pass+0x54/0x70

To fix it, rather than using icp_ops which might be NULL, have both xics and
xive set smp_ops->cause_ipi, and then in the powernv code we save that as
ic_cause_ipi before overriding smp_ops->cause_ipi. For paranoia add a WARN_ON()
to check if somehow smp_ops->cause_ipi is NULL.

Fixes: b866cc2199d6 ("powerpc: Change the doorbell IPI calling convention")
Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-26 23:28:12 +10:00
Michael Ellerman
83c4919058 powerpc/powernv: Fix missing attr initialisation in opal_export_attrs()
In opal_export_attrs() we dynamically allocate some bin_attributes. They're
allocated with kmalloc() and although we initialise most of the fields, we don't
initialise write() or mmap(), and in particular we don't initialise the lockdep
related fields in the embedded struct attribute.

This leads to a lockdep warning at boot:

  BUG: key c0000000f11906d8 not in .data!
  WARNING: CPU: 0 PID: 1 at ../kernel/locking/lockdep.c:3136 lockdep_init_map+0x28c/0x2a0
  ...
  Call Trace:
    lockdep_init_map+0x288/0x2a0 (unreliable)
    __kernfs_create_file+0x8c/0x170
    sysfs_add_file_mode_ns+0xc8/0x240
    __machine_initcall_powernv_opal_init+0x60c/0x684
    do_one_initcall+0x60/0x1c0
    kernel_init_freeable+0x2f4/0x3d4
    kernel_init+0x24/0x160
    ret_from_kernel_thread+0x5c/0xb0

Fix it by kzalloc'ing the attr, which fixes the uninitialised write() and
mmap(), and calling sysfs_bin_attr_init() on it to initialise the lockdep
fields.

Fixes: 11fe909d2362 ("powerpc/powernv: Add OPAL exports attributes to sysfs")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-26 15:57:19 +10:00