24309 Commits

Author SHA1 Message Date
Nicholas Piggin
4a5cb51f3d powerpc/64s/interrupt: Fix check_return_regs_valid() false positive
The check_return_regs_valid() can cause a false positive if the return
regs are marked as norestart and they are an HSRR type interrupt,
because the low bit in the bottom of regs->trap causes interrupt type
matching to fail.

This can occcur for example on bare metal with a HV privileged doorbell
interrupt that causes a signal, but do_signal returns early because
get_signal() fails, and takes the "No signal to deliver" path. In this
case no signal was delivered so the return location is not changed so
return SRRs are not invalidated, yet set_trap_norestart is called, which
messes up the match. Building go-1.16.6 is known to reproduce this.

Fix it by using the TRAP() accessor which masks out the low bit.

Fixes: 6eaaf9de3599 ("powerpc/64s/interrupt: Check and fix srr_valid without crashing")
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211026122531.3599918-1-npiggin@gmail.com
2021-10-27 22:33:47 +11:00
Christophe Leroy
b949d009dd powerpc/boot: Set LC_ALL=C in wrapper script
While trying to build a simple Image for ACADIA platform, I got the
following error:

	  WRAP    arch/powerpc/boot/simpleImage.acadia
	INFO: Uncompressed kernel (size 0x6ae7d0) overlaps the address of the wrapper(0x400000)
	INFO: Fixing the link_address of wrapper to (0x700000)
	powerpc64-linux-gnu-ld : mode d'émulation non reconnu : -T
	Émulations prises en charge : elf64ppc elf32ppc elf32ppclinux elf32ppcsim elf64lppc elf32lppc elf32lppclinux elf32lppcsim
	make[1]: *** [arch/powerpc/boot/Makefile:424 : arch/powerpc/boot/simpleImage.acadia] Erreur 1
	make: *** [arch/powerpc/Makefile:285 : simpleImage.acadia] Erreur 2

Trying again with V=1 shows the following command

	powerpc64-linux-gnu-ld -m -T arch/powerpc/boot/zImage.lds -Ttext 0x700000 --no-dynamic-linker -o arch/powerpc/boot/simpleImage.acadia -Map wrapper.map arch/powerpc/boot/fixed-head.o arch/powerpc/boot/simpleboot.o ./zImage.3278022.o arch/powerpc/boot/wrapper.a

The argument of '-m' is missing.

This is due to the wrapper script calling 'objdump -p vmlinux' and
looking for 'file format', whereas the output of objdump is:

	vmlinux:     format de fichier elf32-powerpc

	En-tête de programme:
	    LOAD off    0x00010000 vaddr 0xc0000000 paddr 0x00000000 align 2**16
	         filesz 0x0069e1d4 memsz 0x006c128c flags rwx
	    NOTE off    0x0064591c vaddr 0xc063591c paddr 0x0063591c align 2**2
	         filesz 0x00000054 memsz 0x00000054 flags ---

Add LC_ALL=C at the beginning of the wrapper script in order to get the
output expected by the script:

	vmlinux:     file format elf32-powerpc

	Program Header:
	    LOAD off    0x00010000 vaddr 0xc0000000 paddr 0x00000000 align 2**16
	         filesz 0x0069e1d4 memsz 0x006c128c flags rwx
	    NOTE off    0x0064591c vaddr 0xc063591c paddr 0x0063591c align 2**2
	         filesz 0x00000054 memsz 0x00000054 flags ---

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a9ff3bc98035f63b122c051f02dc47c7aed10430.1635256089.git.christophe.leroy@csgroup.eu
2021-10-27 22:31:22 +11:00
Joel Stanley
f22969a660 powerpc/64s: Default to 64K pages for 64 bit book3s
For 64-bit book3s the default should be 64K as that's what modern CPUs
are designed for.

The following defconfigs already set CONFIG_PPC_64K_PAGES:

 cell_defconfig
 pasemi_defconfig
 powernv_defconfig
 ppc64_defconfig
 pseries_defconfig
 skiroot_defconfig

The have the option removed from the defconfig, as it is now the
default.

The defconfigs that now need to set CONFIG_PPC_4K_PAGES to maintain
their existing behaviour are:

 g5_defconfig
 maple_defconfig
 microwatt_defconfig
 ps3_defconfig

Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
BugLink: https://github.com/linuxppc/issues/issues/109
Link: https://lore.kernel.org/r/20211015001649.45591-1-joel@jms.id.au
2021-10-27 22:31:22 +11:00
Michael Ellerman
b7472e1764 Revert "powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC"
This reverts commit 566af8cda399c088763d07464463dc871c943b54.

This caused some conflicts vs the audit tree, and the audit maintainers
would prefer we postpone this to the next merge window so we have more
time for testing.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2021-10-27 22:30:32 +11:00
Eric W. Biederman
83a1f27ad7 signal/powerpc: On swapcontext failure force SIGSEGV
If the register state may be partial and corrupted instead of calling
do_exit, call force_sigsegv(SIGSEGV).  Which properly kills the
process with SIGSEGV and does not let any more userspace code execute,
instead of just killing one thread of the process and potentially
confusing everything.

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 756f1ae8a44e ("PPC32: Rework signal code and add a swapcontext system call.")
Fixes: 04879b04bf50 ("[PATCH] ppc64: VMX (Altivec) support & signal32 rework, from Ben Herrenschmidt")
Link: https://lkml.kernel.org/r/20211020174406.17889-7-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-10-25 15:56:29 -05:00
Alexey Kardashevskiy
d853adc7ad powerpc/pseries/iommu: Create huge DMA window if no MMIO32 is present
The iommu_init_table() helper takes an address range to reserve in
the IOMMU table being initialized to exclude MMIO addresses, this is
useful if the window stretches far beyond 4GB (although wastes some TCEs).
At the moment the code searches for such MMIO32 range and fails if none
found which is considered a problem while it really is not: it is actually
better as this says there is no MMIO32 to reserve and we can use
usually wasted TCEs. Furthermore PHYP never actually allows creating
windows starting at busaddress=0 so this MMIO32 range is never useful.

This removes error exit and initializes the table with zero range if
no MMIO32 is detected.

Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020132315.2287178-5-aik@ozlabs.ru
2021-10-25 11:41:15 +11:00
Alexey Kardashevskiy
92fe01b7c6 powerpc/pseries/iommu: Check if the default window in use before removing it
At the moment this check is performed after we remove the default window
which is late and disallows to revert whatever changes enable_ddw()
has made to DMA windows.

This moves the check and error exit before removing the window.

This raised the message severity from "debug" to "warning" as this
should not happen in practice and cannot be triggered by the userspace.

Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020132315.2287178-4-aik@ozlabs.ru
2021-10-25 11:41:14 +11:00
Alexey Kardashevskiy
41ee7232fa powerpc/pseries/iommu: Use correct vfree for it_map
The it_map array is vzalloc'ed so use vfree() for it when creating
a huge DMA window failed for whatever reason.

While at this, write zero to it_map.

Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020132315.2287178-3-aik@ozlabs.ru
2021-10-25 11:41:14 +11:00
Masahiro Yamada
8212f8986d kbuild: use more subdir- for visiting subdirectories while cleaning
Documentation/kbuild/makefiles.rst suggests to use "archclean" for
cleaning arch/$(SRCARCH)/boot/, but it is not a hard requirement.

Since commit d92cc4d51643 ("kbuild: require all architectures to have
arch/$(SRCARCH)/Kbuild"), we can use the "subdir- += boot" trick for
all architectures. This can take advantage of the parallel option (-j)
for "make clean".

I also cleaned up the comments in arch/$(SRCARCH)/Makefile. The "archdep"
target no longer exists.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
2021-10-24 13:49:46 +09:00
Nathan Lynch
319fa1a52e powerpc/pseries/mobility: ignore ibm, platform-facilities updates
On VMs with NX encryption, compression, and/or RNG offload, these
capabilities are described by nodes in the ibm,platform-facilities device
tree hierarchy:

  $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
  /sys/firmware/devicetree/base/ibm,platform-facilities/
  ├── ibm,compression-v1
  ├── ibm,random-v1
  └── ibm,sym-encryption-v1

  3 directories

The acceleration functions that these nodes describe are not disrupted by
live migration, not even temporarily.

But the post-migration ibm,update-nodes sequence firmware always sends
"delete" messages for this hierarchy, followed by an "add" directive to
reconstruct it via ibm,configure-connector (log with debugging statements
enabled in mobility.c):

  mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
  mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
  mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
  mobility: removing node /ibm,platform-facilities:4294967286
  ...
  mobility: added node /ibm,platform-facilities:4294967286

Note we receive a single "add" message for the entire hierarchy, and what
we receive from the ibm,configure-connector sequence is the top-level
platform-facilities node along with its three children. The debug message
simply reports the parent node and not the whole subtree.

Also, significantly, the nodes added are almost completely equivalent to
the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
the leaf nodes is the only property I've observed to differ, and Linux does
not use that. So in practice, the sum of update messages Linux receives for
this hierarchy is equivalent to minor property updates.

We succeed in removing the original hierarchy from the device tree. But the
vio bus code is ignorant of this, and does not unbind or relinquish its
references. The leaf nodes, still reachable through sysfs, of course still
refer to the now-freed ibm,platform-facilities parent node, which makes
use-after-free possible:

  refcount_t: addition on 0; use-after-free.
  WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
  refcount_warn_saturate+0x160/0x1f0 (unreliable)
  kobject_get+0xf0/0x100
  of_node_get+0x30/0x50
  of_get_parent+0x50/0xb0
  of_fwnode_get_parent+0x54/0x90
  fwnode_count_parents+0x50/0x150
  fwnode_full_name_string+0x30/0x110
  device_node_string+0x49c/0x790
  vsnprintf+0x1c0/0x4c0
  sprintf+0x44/0x60
  devspec_show+0x34/0x50
  dev_attr_show+0x40/0xa0
  sysfs_kf_seq_show+0xbc/0x200
  kernfs_seq_show+0x44/0x60
  seq_read_iter+0x2a4/0x740
  kernfs_fop_read_iter+0x254/0x2e0
  new_sync_read+0x120/0x190
  vfs_read+0x1d0/0x240

Moreover, the "new" replacement subtree is not correctly added to the
device tree, resulting in ibm,platform-facilities parent node without the
appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:

  $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
  /sys/firmware/devicetree/base/ibm,platform-facilities/

  0 directories

  $ cd /sys/devices/vio ; find . -xtype l -exec file {} +
  ./ibm,sym-encryption-v1/of_node: broken symbolic link to
    ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
  ./ibm,random-v1/of_node:         broken symbolic link to
    ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
  ./ibm,compression-v1/of_node:    broken symbolic link to
    ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1

This is because add_dt_node() -> dlpar_attach_node() attaches only the
parent node returned from configure-connector, ignoring any children. This
should be corrected for the general case, but fixing that won't help with
the stale OF node references, which is the more urgent problem.

One way to address that would be to make the drivers respond to node
removal notifications, so that node references can be dropped
appropriately. But this would likely force the drivers to disrupt active
clients for no useful purpose: equivalent nodes are immediately re-added.
And recall that the acceleration capabilities described by the nodes remain
available throughout the whole process.

The solution I believe to be robust for this situation is to convert
remove+add of a node with an unchanged phandle to an update of the node's
properties in the Linux device tree structure. That would involve changing
and adding a fair amount of code, and may take several iterations to land.

Until that can be realized we have a confirmed use-after-free and the
possibility of memory corruption. So add a limited workaround that
discriminates on the node type, ignoring adds and removes. This should be
amenable to backporting in the meantime.

Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel")
Cc: stable@vger.kernel.org
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
2021-10-22 15:22:06 +11:00
Christophe Leroy
c7d19189d7 powerpc/32: Don't use a struct based type for pte_t
Long time ago we had a config item called STRICT_MM_TYPECHECKS
to build the kernel with pte_t defined as a structure in order
to perform additional build checks or build it with pte_t
defined as a simple type in order to get simpler generated code.

Commit 670eea924198 ("powerpc/mm: Always use STRICT_MM_TYPECHECKS")
made the struct based definition the only one, considering that the
generated code was similar in both cases.

That's right on ppc64 because the ABI is such that the content of a
struct having a single simple type element is passed as register,
but on ppc32 such a structure is passed via the stack like any
structure.

Simple test function:

	pte_t test(pte_t pte)
	{
		return pte;
	}

Before this patch we get

	c00108ec <test>:
	c00108ec:	81 24 00 00 	lwz     r9,0(r4)
	c00108f0:	91 23 00 00 	stw     r9,0(r3)
	c00108f4:	4e 80 00 20 	blr

So, for PPC32, restore the simple type behaviour we got before
commit 670eea924198, but instead of adding a config option to
activate type check, do it when __CHECKER__ is set so that type
checking is performed by 'sparse' and provides feedback like:

	arch/powerpc/mm/pgtable.c:466:16: warning: incorrect type in return expression (different base types)
	arch/powerpc/mm/pgtable.c:466:16:    expected unsigned long
	arch/powerpc/mm/pgtable.c:466:16:    got struct pte_t [usertype] x

With this patch we now get

	c0010890 <test>:
	c0010890:	4e 80 00 20 	blr

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Define STRICT_MM_TYPECHECKS rather than repeating the condition]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c904599f33aaf6bb7ee2836a9ff8368509e0d78d.1631887042.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:06 +11:00
Christophe Leroy
a61ec782a7 powerpc/breakpoint: Cleanup
cache_op_size() does exactly the same as l1_dcache_bytes().

Remove it.

MSR_64BIT already exists, no need to enclode the check
around #ifdef __powerpc64__

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6184b08088312a7d787d450eb902584e4ae77f7a.1632317816.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:06 +11:00
Christophe Leroy
fdacae8a84 powerpc: Activate CONFIG_STRICT_KERNEL_RWX by default
CONFIG_STRICT_KERNEL_RWX should be set by default on every
architectures (See https://github.com/KSPP/linux/issues/4)

On PPC32 we have to find a compromise between performance and/or
memory wasting and selection of strict_kernel_rwx, because it implies
either smaller memory chunks or larger alignment between RO memory
and RW memory.

For instance the 8xx maps memory with 8M pages. So either the limit
between RO and RW must be 8M aligned or it falls back or 512k pages
which implies more pressure on the TLB.

book3s/32 maps memory with BATs as much as possible. BATS can have
any power-of-two size between 128k and 256M but we have only 4 to 8
BATs so the alignment must be good enough to allow efficient use of
the BATs and avoid falling back on standard page mapping which would
kill performance.

So let's go one step forward and make it the default but still allow
users to unset it when wanted.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/057c40164084bfc7d77c0b2ff78d95dbf6a2a21b.1632503622.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:05 +11:00
Christophe Leroy
63f501e07a powerpc/8xx: Simplify TLB handling
In the old days, TLB handling for 8xx was using tlbie and tlbia
instructions directly as much as possible.

But commit f048aace29e0 ("powerpc/mm: Add SMP support to no-hash
TLB handling") broke that by introducing out-of-line unnecessary
complex functions for booke/smp which don't have tlbie/tlbia
instructions and require more complex handling.

Restore direct use of tlbie and tlbia for 8xx which is never SMP.

With this patch we now get

	c00ecc68 <ptep_clear_flush>:
	c00ecc68:	39 00 00 00 	li      r8,0
	c00ecc6c:	81 46 00 00 	lwz     r10,0(r6)
	c00ecc70:	91 06 00 00 	stw     r8,0(r6)
	c00ecc74:	7c 00 2a 64 	tlbie   r5,r0
	c00ecc78:	7c 00 04 ac 	hwsync
	c00ecc7c:	91 43 00 00 	stw     r10,0(r3)
	c00ecc80:	4e 80 00 20 	blr

Before it was

	c0012880 <local_flush_tlb_page>:
	c0012880:	2c 03 00 00 	cmpwi   r3,0
	c0012884:	41 82 00 54 	beq     c00128d8 <local_flush_tlb_page+0x58>
	c0012888:	81 22 00 00 	lwz     r9,0(r2)
	c001288c:	81 43 00 20 	lwz     r10,32(r3)
	c0012890:	39 29 00 01 	addi    r9,r9,1
	c0012894:	91 22 00 00 	stw     r9,0(r2)
	c0012898:	2c 0a 00 00 	cmpwi   r10,0
	c001289c:	41 82 00 10 	beq     c00128ac <local_flush_tlb_page+0x2c>
	c00128a0:	81 2a 01 dc 	lwz     r9,476(r10)
	c00128a4:	2c 09 ff ff 	cmpwi   r9,-1
	c00128a8:	41 82 00 0c 	beq     c00128b4 <local_flush_tlb_page+0x34>
	c00128ac:	7c 00 22 64 	tlbie   r4,r0
	c00128b0:	7c 00 04 ac 	hwsync
	c00128b4:	81 22 00 00 	lwz     r9,0(r2)
	c00128b8:	39 29 ff ff 	addi    r9,r9,-1
	c00128bc:	2c 09 00 00 	cmpwi   r9,0
	c00128c0:	91 22 00 00 	stw     r9,0(r2)
	c00128c4:	4c a2 00 20 	bclr+   4,eq
	c00128c8:	81 22 00 70 	lwz     r9,112(r2)
	c00128cc:	71 29 00 04 	andi.   r9,r9,4
	c00128d0:	4d 82 00 20 	beqlr
	c00128d4:	48 65 76 74 	b       c0669f48 <preempt_schedule>
	c00128d8:	81 22 00 00 	lwz     r9,0(r2)
	c00128dc:	39 29 00 01 	addi    r9,r9,1
	c00128e0:	91 22 00 00 	stw     r9,0(r2)
	c00128e4:	4b ff ff c8 	b       c00128ac <local_flush_tlb_page+0x2c>
...
	c00ecdc8 <ptep_clear_flush>:
	c00ecdc8:	94 21 ff f0 	stwu    r1,-16(r1)
	c00ecdcc:	39 20 00 00 	li      r9,0
	c00ecdd0:	93 c1 00 08 	stw     r30,8(r1)
	c00ecdd4:	83 c6 00 00 	lwz     r30,0(r6)
	c00ecdd8:	91 26 00 00 	stw     r9,0(r6)
	c00ecddc:	93 e1 00 0c 	stw     r31,12(r1)
	c00ecde0:	7c 08 02 a6 	mflr    r0
	c00ecde4:	7c 7f 1b 78 	mr      r31,r3
	c00ecde8:	7c 83 23 78 	mr      r3,r4
	c00ecdec:	7c a4 2b 78 	mr      r4,r5
	c00ecdf0:	90 01 00 14 	stw     r0,20(r1)
	c00ecdf4:	4b f2 5a 8d 	bl      c0012880 <local_flush_tlb_page>
	c00ecdf8:	93 df 00 00 	stw     r30,0(r31)
	c00ecdfc:	7f e3 fb 78 	mr      r3,r31
	c00ece00:	80 01 00 14 	lwz     r0,20(r1)
	c00ece04:	83 c1 00 08 	lwz     r30,8(r1)
	c00ece08:	83 e1 00 0c 	lwz     r31,12(r1)
	c00ece0c:	7c 08 03 a6 	mtlr    r0
	c00ece10:	38 21 00 10 	addi    r1,r1,16
	c00ece14:	4e 80 00 20 	blr

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/fb324f1c8f2ddb57cf6aad1cea26329558f1c1c0.1631887021.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:05 +11:00
Christophe Leroy
e28d0b6750 powerpc/lib/sstep: Don't use __{get/put}_user() on kernel addresses
In the old days, when we didn't have kernel userspace access
protection and had set_fs(), it was wise to use __get_user()
and friends to read kernel memory.

Nowadays, get_user() and put_user() are granting userspace access and
are exclusively for userspace access.

Convert single step emulation functions to user_access_begin() and
friends and use unsafe_get_user() and unsafe_put_user().

When addressing kernel addresses, there is no need to open userspace
access. And for book3s/32 it is particularly important to no try and
open userspace access on kernel address, because that would break the
content of kernel space segment registers. No guard has been put
against that risk in order to avoid degrading performance.

copy_from_kernel_nofault() and copy_to_kernel_nofault() should
be used but they are out-of-line functions which would degrade
performance. Those two functions are making use of
__get_kernel_nofault() and __put_kernel_nofault() macros.
Those two macros are just wrappers behind __get_user_size_goto() and
__put_user_size_goto().

unsafe_get_user() and unsafe_put_user() are also wrappers of
__get_user_size_goto() and __put_user_size_goto(). Use them to
access kernel space. That allows refactoring userspace and
kernelspace access.

Reported-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Depends-on: 4fe5cda9f89d ("powerpc/uaccess: Implement user_read_access_begin and user_write_access_begin")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/22831c9d17f948680a12c5292e7627288b15f713.1631817805.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:05 +11:00
Christophe Leroy
cbe654c779 powerpc: warn on emulation of dcbz instruction in kernel mode
dcbz instruction shouldn't be used on non-cached memory. Using
it on non-cached memory can result in alignment exception and
implies a heavy handling.

Instead of silentely emulating the instruction and resulting in high
performance degradation, warn whenever an alignment exception is
taken in kernel mode due to dcbz, so that the user is made aware that
dcbz instruction has been used unexpectedly by the kernel.

Reported-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/2e3acfe63d289c6fba366e16973c9ab8369e8b75.1631803922.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:05 +11:00
Christophe Leroy
5c810ced36 powerpc/32: Add support for out-of-line static calls
Add support for out-of-line static calls on PPC32. This change
improve performance of calls to global function pointers by
using direct calls instead of indirect calls.

The trampoline is initialy populated with a 'blr' or branch to target,
followed by an unreachable long jump sequence.

In order to cater with parallele execution, the trampoline needs to
be updated in a way that ensures it remains consistent at all time.
This means we can't use the traditional lis/addi to load r12 with
the target address, otherwise there would be a window during which
the first instruction contains the upper part of the new target
address while the second instruction still contains the lower part of
the old target address. To avoid that the target address is stored
just after the 'bctr' and loaded from there with a single instruction.

Then, depending on the target distance, arch_static_call_transform()
will either replace the first instruction by a direct 'bl <target>' or
'nop' in order to have the trampoline fall through the long jump
sequence.

For the special case of __static_call_return0(), to avoid the risk of
a far branch, a version of it is inlined at the end of the trampoline.

Performancewise the long jump sequence is probably not better than
the indirect calls set by GCC when we don't use static calls, but
such calls are unlikely to be required on powerpc32: With most
configurations the kernel size is far below 32 Mbytes so only
modules may happen to be too far. And even modules are likely to
be close enough as they are allocated below the kernel core and
as close as possible of the kernel text.

static_call selftest is running successfully with this change.

With this patch, __do_irq() has the following sequence to trace
irq entries:

	c0004a00 <__SCT__tp_func_irq_entry>:
	c0004a00:	48 00 00 e0 	b       c0004ae0 <__traceiter_irq_entry>
	c0004a04:	3d 80 c0 00 	lis     r12,-16384
	c0004a08:	81 8c 4a 1c 	lwz     r12,18972(r12)
	c0004a0c:	7d 89 03 a6 	mtctr   r12
	c0004a10:	4e 80 04 20 	bctr
	c0004a14:	38 60 00 00 	li      r3,0
	c0004a18:	4e 80 00 20 	blr
	c0004a1c:	00 00 00 00 	.long 0x0
...
	c0005654 <__do_irq>:
...
	c0005664:	7c 7f 1b 78 	mr      r31,r3
...
	c00056a0:	81 22 00 00 	lwz     r9,0(r2)
	c00056a4:	39 29 00 01 	addi    r9,r9,1
	c00056a8:	91 22 00 00 	stw     r9,0(r2)
	c00056ac:	3d 20 c0 af 	lis     r9,-16209
	c00056b0:	81 29 74 cc 	lwz     r9,29900(r9)
	c00056b4:	2c 09 00 00 	cmpwi   r9,0
	c00056b8:	41 82 00 10 	beq     c00056c8 <__do_irq+0x74>
	c00056bc:	80 69 00 04 	lwz     r3,4(r9)
	c00056c0:	7f e4 fb 78 	mr      r4,r31
	c00056c4:	4b ff f3 3d 	bl      c0004a00 <__SCT__tp_func_irq_entry>

Before this patch, __do_irq() was doing the following to trace irq
entries:

	c0005700 <__do_irq>:
...
	c0005710:	7c 7e 1b 78 	mr      r30,r3
...
	c000574c:	93 e1 00 0c 	stw     r31,12(r1)
	c0005750:	81 22 00 00 	lwz     r9,0(r2)
	c0005754:	39 29 00 01 	addi    r9,r9,1
	c0005758:	91 22 00 00 	stw     r9,0(r2)
	c000575c:	3d 20 c0 af 	lis     r9,-16209
	c0005760:	83 e9 f4 cc 	lwz     r31,-2868(r9)
	c0005764:	2c 1f 00 00 	cmpwi   r31,0
	c0005768:	41 82 00 24 	beq     c000578c <__do_irq+0x8c>
	c000576c:	81 3f 00 00 	lwz     r9,0(r31)
	c0005770:	80 7f 00 04 	lwz     r3,4(r31)
	c0005774:	7d 29 03 a6 	mtctr   r9
	c0005778:	7f c4 f3 78 	mr      r4,r30
	c000577c:	4e 80 04 21 	bctrl
	c0005780:	85 3f 00 0c 	lwzu    r9,12(r31)
	c0005784:	2c 09 00 00 	cmpwi   r9,0
	c0005788:	40 82 ff e4 	bne     c000576c <__do_irq+0x6c>

Behind the fact of now using a direct 'bl' instead of a
'load/mtctr/bctr' sequence, we can also see that we get one less
register on the stack.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:05 +11:00
Christophe Leroy
8f7fadb4ba powerpc/machdep: Remove stale functions from ppc_md structure
ppc_md.iommu_save() is not set anymore by any platform after
commit c40785ad305b ("powerpc/dart: Use a cachable DART").
So iommu_save() has become a nop and can be removed.

ppc_md.show_percpuinfo() is not set anymore by any platform after
commit 4350147a816b ("[PATCH] ppc64: SMU based macs cpufreq support").

Last users of ppc_md.rtc_read_val() and ppc_md.rtc_write_val() were
removed by commit 0f03a43b8f0f ("[POWERPC] Remove todc code from
ARCH=powerpc")

Last user of kgdb_map_scc() was removed by commit 17ce452f7ea3 ("kgdb,
powerpc: arch specific powerpc kgdb support").

ppc.machine_kexec_prepare() has not been used since
commit 8ee3e0d69623 ("powerpc: Remove the main legacy iSerie platform
code"). This allows the removal of machine_kexec_prepare() and the
rename of default_machine_kexec_prepare() into machine_kexec_prepare()

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Daniel Axtens <dja@axtens.net>
[mpe: Drop prototype for default_machine_kexec_prepare() as noted by dja]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/24d4ca0ada683c9436a5f812a7aeb0a1362afa2b.1630398606.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:05 +11:00
Christophe Leroy
e606a2f46c powerpc/time: Remove generic_suspend_{dis/en}able_irqs()
Commit d75d68cfef49 ("powerpc: Clean up obsolete code relating to
decrementer and timebase") made generic_suspend_enable_irqs() and
generic_suspend_disable_irqs() static.

Fold them into their only caller.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c3f9ec9950394ef939014f7934268e6ee30ca04f.1630398566.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:05 +11:00
Christophe Leroy
566af8cda3 powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC
Commit e65e1fc2d24b ("[PATCH] syscall class hookup for all normal
targets") added generic support for AUDIT but that didn't include
support for bi-arch like powerpc.

Commit 4b58841149dc ("audit: Add generic compat syscall support")
added generic support for bi-arch.

Convert powerpc to that bi-arch generic audit support.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a4b3951d1191d4183d92a07a6097566bde60d00a.1629812058.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:04 +11:00
Christophe Leroy
a85c728cb5 powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
Instructions lmw/stmw are interesting for functions that are rarely
used and not in the cache, because only one instruction is to be
copied into the instruction cache instead of 19. However those
instruction are less performant than 19x raw lwz/stw as they require
synchronisation plus one additional cycle.

SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are
mostly in interrupts entries/exits and in task switch so they are
likely already in the cache.

Using standard lwz improves null_syscall selftest by:
- 10 cycles on mpc832x.
- 2 cycles on mpc8xx.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/316c543b8906712c108985c8463eec09c8db577b.1629732542.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:04 +11:00
Anatolij Gustschin
aed2886a5e powerpc/5200: dts: fix memory node unit name
Fixes build warnings:
Warning (unit_address_vs_reg): /memory: node has a reg or ranges property, but no unit name

Signed-off-by: Anatolij Gustschin <agust@denx.de>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211013220532.24759-4-agust@denx.de
2021-10-22 15:22:04 +11:00
Anatolij Gustschin
7855b6c66d powerpc/5200: dts: fix pci ranges warnings
Fix ranges property warnings:
pci@f0000d00:ranges: 'oneOf' conditional failed, one must be fixed:

Signed-off-by: Anatolij Gustschin <agust@denx.de>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211013220532.24759-3-agust@denx.de
2021-10-22 15:22:04 +11:00
Anatolij Gustschin
e9efabc6e4 powerpc/5200: dts: add missing pci ranges
Add ranges property to fix build warnings:
Warning (pci_bridge): /pci@f0000d00: missing ranges for PCI bridge (or not a bridge)

Signed-off-by: Anatolij Gustschin <agust@denx.de>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211013220532.24759-2-agust@denx.de
2021-10-22 15:22:04 +11:00
Gustavo A. R. Silva
61cb9ac66b powerpc/vas: Fix potential NULL pointer dereference
(!ptr && !ptr->foo) strikes again. :)

The expression (!ptr && !ptr->foo) is bogus and in case ptr is NULL,
it leads to a NULL pointer dereference: ptr->foo.

Fix this by converting && to ||

This issue was detected with the help of Coccinelle, and audited and
fixed manually.

Fixes: 1a0d0d5ed5e3 ("powerpc/vas: Add platform specific user window operations")
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211015050345.GA1161918@embeddedor
2021-10-22 15:22:04 +11:00
Christophe Leroy
49e3d8ea62 powerpc/fsl_booke: Enable STRICT_KERNEL_RWX
Enable STRICT_KERNEL_RWX on fsl_booke.

For that, we need additional TLBCAMs dedicated to linear mapping,
based on the alignment of _sinittext.

By default, up to 768 Mbytes of memory are mapped.
It uses 3 TLBCAMs of size 256 Mbytes.

With a data alignment of 16, we need up to 9 TLBCAMs:
  16/16/16/16/64/64/64/256/256

With a data alignment of 4, we need up to 12 TLBCAMs:
  4/4/4/4/16/16/16/64/64/64/256/256

With a data alignment of 1, we need up to 15 TLBCAMs:
  1/1/1/1/4/4/4/16/16/16/64/64/64/256/256

By default, set a 16 Mbytes alignment as a compromise between memory
usage and number of TLBCAMs. This can be adjusted manually when needed.

For the time being, it doens't work when the base is randomised.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/29f9e5d2bbbc83ae9ca879265426a6278bf4d5bb.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:03 +11:00
Christophe Leroy
d5970045cf powerpc/fsl_booke: Update of TLBCAMs after init
After init, set readonly memory as ROX and set readwrite
memory as RWX, if STRICT_KERNEL_RWX is enabled.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/66bef0b9c273e1121706883f3cf5ad0a053d863f.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:03 +11:00
Christophe Leroy
0b2859a743 powerpc/fsl_booke: Allocate separate TLBCAMs for readonly memory
Reorganise TLBCAM allocation so that when STRICT_KERNEL_RWX is
enabled, TLBCAMs are allocated such that readonly memory uses
different TLBCAMs.

This results in an allocation looking like:

Memory CAM mapping: 4/4/4/1/1/1/1/16/16/16/64/64/64/256/256 Mb, residual: 256Mb

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8ca169bc288261a0e0558712f979023c3a960ebb.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:03 +11:00
Christophe Leroy
52bda69ae8 powerpc/fsl_booke: Tell map_mem_in_cams() if init is done
In order to be able to call map_mem_in_cams() once more
after init for STRICT_KERNEL_RWX, add an argument.

For now, map_mem_in_cams() is always called only during init.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/3b69a7e0b393b16984ade882a5eae5d727717459.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:03 +11:00
Christophe Leroy
a97dd9e2f7 powerpc/fsl_booke: Enable reloading of TLBCAM without switching to AS1
Avoid switching to AS1 when reloading TLBCAM after init for
STRICT_KERNEL_RWX.

When we setup AS1 we expect the entire accessible memory to be mapped
through one entry, this is not the case anymore at the end of init.

We are not changing the size of TLBCAMs, only flags, so no need to
switch to AS1.

So change loadcam_multi() to not switch to AS1 when the given
temporary tlb entry in 0.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a9d517fbfbc940f56103c46b323f6eb8f4485571.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:03 +11:00
Christophe Leroy
01116e6e98 powerpc/fsl_booke: Take exec flag into account when setting TLBCAMs
Don't force MAS3_SX and MAS3_UX at all time. Take into account the
exec flag.

While at it, fix a couple of closeby style problems (indent with space
and unnecessary parenthesis), it keeps more readability.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5467044e59f27f9fcf709b9661779e3ce5f784f6.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:03 +11:00
Christophe Leroy
3a75fd709c powerpc/fsl_booke: Rename fsl_booke.c to fsl_book3e.c
We have a myriad of CONFIG symbols around different variants
of BOOKEs, which would be worth tidying up one day.

But at least, make file names and CONFIG option match:

We have CONFIG_FSL_BOOKE and CONFIG_PPC_FSL_BOOK3E.

fsl_booke.c is selected by and only by CONFIG_PPC_FSL_BOOK3E.
So rename it fsl_book3e to reduce confusion.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5dc871db1f67739319bec11f049ca450da1c13a2.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:02 +11:00
Christophe Leroy
68b44f94d6 powerpc/booke: Disable STRICT_KERNEL_RWX, DEBUG_PAGEALLOC and KFENCE
fsl_booke and 44x are not able to map kernel linear memory with
pages, so they can't support DEBUG_PAGEALLOC and KFENCE, and
STRICT_KERNEL_RWX is also a problem for now.

Enable those only on book3s (both 32 and 64 except KFENCE), 8xx and 40x.

Fixes: 88df6e90fa97 ("[POWERPC] DEBUG_PAGEALLOC for 32-bit")
Fixes: 95902e6c8864 ("powerpc/mm: Implement STRICT_KERNEL_RWX on PPC32")
Fixes: 90cbac0e995d ("powerpc: Enable KFENCE for PPC32")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d1ad9fdd9b27da3fdfa16510bb542ed51fa6e134.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22 15:22:02 +11:00
Wan Jiabing
7453f501d4 powerpc/kexec_file: Add of_node_put() before goto
Fix following coccicheck warning:
./arch/powerpc/kexec/file_load_64.c:698:1-22: WARNING: Function
for_each_node_by_type should have of_node_put() before goto

Early exits from for_each_node_by_type should decrement the
node reference counter.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211018015418.10182-1-wanjiabing@vivo.com
2021-10-22 15:22:02 +11:00
Wan Jiabing
915b368f69 powerpc/pseries/iommu: Add of_node_put() before break
Fix following coccicheck warning:

./arch/powerpc/platforms/pseries/iommu.c:924:1-28: WARNING: Function
for_each_node_with_property should have of_node_put() before break

Early exits from for_each_node_with_property should decrement the
node reference counter.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211014075624.16344-1-wanjiabing@vivo.com
2021-10-22 15:22:02 +11:00
Joel Stanley
4f703e7faa powerpc/s64: Clarify that radix lacks DEBUG_PAGEALLOC
The page_alloc.c code will call into __kernel_map_pages() when
DEBUG_PAGEALLOC is configured and enabled.

As the implementation assumes hash, this should crash spectacularly if
not for a bit of luck in __kernel_map_pages(). In this function
linear_map_hash_count is always zero, the for loop exits without doing
any damage.

There are no other platforms that determine if they support
debug_pagealloc at runtime. Instead of adding code to mm/page_alloc.c to
do that, this change turns the map/unmap into a noop when in radix
mode and prints a warning once.

Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Reformat if per Christophe's suggestion]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211013213438.675095-1-joel@jms.id.au
2021-10-22 15:22:02 +11:00
Linus Torvalds
0a3221b658 powerpc fixes for 5.15 #5
Fix a bug exposed by a previous fix, where running guests with certain SMT topologies
 could crash the host on Power8.
 
 Fix atomic sleep warnings when re-onlining CPUs, when PREEMPT is enabled.
 
 Thanks to: Nathan Lynch, Srikar Dronamraju, Valentin Schneider.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmFxTaUTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgFj/D/9J/rz1zYKxvPd+o5WsJBWrgsVBxgGr
 cIhtKARlh0Yoa9fhIYF6ir4HUT2TOhokwJJco7fcND73rfrHfbWshWEzwaR1Lz/2
 WXr0Np509GQw1KCC9nv5DoE5qqLLL+rS+ul75Khkoi99wa/TDD8kQ5o3zUWtDoad
 D6paUIOf+DjnzzD7hIuTljfuOL/P3S4OyarQEopD69n8IgCTCrSsKdzOzAvlKcuL
 c3OoGAY5gasHowfxLfOigkCYloT4gJES7Kxl92QnPByOJUfqhjfKxiN4l0ydR6Bl
 4dUFUp/PkvvmTZiIA3tFzu0fHOgjUCvICyEAewLX7Bh9fXAVYMRc1a2l4LrCWD8T
 hQqQFskxa/viG7Uq2N58Rts9GtF5nG5d6+4l5Ndxus2+tUledjiuzOv3nsSq64TF
 y+Gws66a+A8a7DjcHo281BWzWn3XCIuug4A0doyvJFNdiy/+zj3ojfUhppZam8ca
 DYo7ot71l0XGvFM0Yhr/NQ9MEdpgT19VC0xU2pafnkp6yGUqjH1b9xelvg+uMZEc
 vXb7h+XV0eY4Rdr+3wTnXIwIWnEslJ23YQxFvhZeWeF321wsR5kbBGvNf/aY6xQB
 rLkNX63RdpyYjHGcn3qwcYTkk8wAjCR719tFal8hRMLw+CCjzqcinb7P5FLi4hQv
 cUV/vDffiwyReg==
 =572c
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.15-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix a bug exposed by a previous fix, where running guests with
   certain SMT topologies could crash the host on Power8.

 - Fix atomic sleep warnings when re-onlining CPUs, when PREEMPT is
   enabled.

Thanks to Nathan Lynch, Srikar Dronamraju, and Valentin Schneider.

* tag 'powerpc-5.15-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/smp: do not decrement idle task preempt count in CPU offline
  powerpc/idle: Don't corrupt back chain when going idle
2021-10-21 15:30:09 -10:00
Len Baker
5dfbbb668a KVM: PPC: Replace zero-length array with flexible array member
There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use "flexible array members" [1] for these cases. The
older style of one-element or zero-length arrays should no longer be
used[2].

Also, make use of the struct_size() helper in kzalloc().

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-10-20 18:30:42 -05:00
Rob Herring
41408b22ec powerpc: Use of_get_cpu_hwid()
Replace open coded parsing of CPU nodes' 'reg' property with
of_get_cpu_hwid().

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211006164332.1981454-8-robh@kernel.org
2021-10-20 13:37:24 -05:00
Nathan Lynch
787252a10d powerpc/smp: do not decrement idle task preempt count in CPU offline
With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
get:

BUG: scheduling while atomic: swapper/1/0/0x00000000
no locks held by swapper/1/0.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ #100
Call Trace:
 dump_stack_lvl+0xac/0x108
 __schedule_bug+0xac/0xe0
 __schedule+0xcf8/0x10d0
 schedule_idle+0x3c/0x70
 do_idle+0x2d8/0x4a0
 cpu_startup_entry+0x38/0x40
 start_secondary+0x2ec/0x3a0
 start_secondary_prolog+0x10/0x14

This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
preempt count, for reasons explained in commit a7c2bb8279d2 ("powerpc:
Re-enable preemption before cpu_die()"), specifically "start_secondary()
expects a preempt_count() of 0."

However, since commit 2c669ef6979c ("powerpc/preempt: Don't touch the idle
task's preempt_count during hotplug") and commit f1a0a376ca0c ("sched/core:
Initialize the idle task with preemption disabled"), that justification no
longer holds.

The idle task isn't supposed to re-enable preemption, so remove the
vestigial preempt_enable() from the CPU offline path.

Tested with pseries and powernv in qemu, and pseries on PowerVM.

Fixes: 2c669ef6979c ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211015173902.2278118-1-nathanl@linux.ibm.com
2021-10-20 21:38:01 +11:00
Michael Ellerman
496c5fe25c powerpc/idle: Don't corrupt back chain when going idle
In isa206_idle_insn_mayloss() we store various registers into the stack
red zone, which is allowed.

However inside the IDLE_STATE_ENTER_SEQ_NORET macro we save r2 again,
to 0(r1), which corrupts the stack back chain.

We used to do the same in isa206_idle_insn_mayloss() itself, but we
fixed that in 73287caa9210 ("powerpc64/idle: Fix SP offsets when saving
GPRs"), however we missed that the macro also corrupts the back chain.

Corrupting the back chain is bad for debuggability but doesn't
necessarily cause a bug.

However we recently changed the stack handling in some KVM code, and it
now relies on the stack back chain being valid when it returns. The
corruption causes that code to return with r1 pointing somewhere in
kernel data, at some point LR is restored from the stack and we branch
to NULL or somewhere else invalid.

Only affects Power8 hosts running KVM guests, with dynamic_mt_modes
enabled (which it is by default).

The fixes tag below points to the commit that changed the KVM stack
handling, exposing this bug. The actual corruption of the back chain has
always existed since 948cf67c4726 ("powerpc: Add NAP mode support on
Power7 in HV mode").

Fixes: 9b4416c5095c ("KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020094826.3222052-1-mpe@ellerman.id.au
2021-10-20 21:37:58 +11:00
Kajol Jain
26da4abfb3 powerpc/perf: Fix data source encodings for L2.1 and L3.1 accesses
Fix the data source encodings to represent L2.1/L3.1(another core's
L2/L3 on the same node) accesses properly for power10 and older
plaforms.

Add new macros(LEVEL/REM) which can be used to add mem_lvl_num and remote
field data inside perf_mem_data_src structure.

Result in power9 system with patch changes:

localhost:~/linux/tools/perf # ./perf mem report | grep Remote
     0.01%             1  252           Remote core, same node L3 or L3 hit  [.] 0x0000000000002dd0                producer_consumer   [.] 0x00007fff7f25eb90
anon               HitM          N/A                     No       N/A        0              0
     0.01%             1  220           Remote core, same node L3 or L3 hit  [.] 0x0000000000002dd0                producer_consumer   [.] 0x00007fff77776d90
anon               HitM          N/A                     No       N/A        0              0
     0.01%             1  220           Remote core, same node L3 or L3 hit  [.] 0x0000000000002dd0                producer_consumer   [.] 0x00007fff817d9410
anon               HitM          N/A                     No       N/A        0              0

Fixes: 79e96f8f930d ("powerpc/perf: Export memory hierarchy info to user space")
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20211006140654.298352-5-kjain@linux.ibm.com
2021-10-19 17:27:01 +02:00
Andreas Gruenbacher
bb523b406c gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
Turn fault_in_pages_{readable,writeable} into versions that return the
number of bytes not faulted in, similar to copy_to_user, instead of
returning a non-zero value when any of the requested pages couldn't be
faulted in.  This supports the existing users that require all pages to
be faulted in as well as new users that are happy if any pages can be
faulted in.

Rename the functions to fault_in_{readable,writeable} to make sure
this change doesn't silently break things.

Neither of these functions is entirely trivial and it doesn't seem
useful to inline them, so move them to mm/gup.c.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-18 16:33:03 +02:00
Andreas Gruenbacher
0c8eb2884a powerpc/kvm: Fix kvm_use_magic_page
When switching from __get_user to fault_in_pages_readable, commit
9f9eae5ce717 broke kvm_use_magic_page: like __get_user,
fault_in_pages_readable returns 0 on success.

Fixes: 9f9eae5ce717 ("powerpc/kvm: Prefer fault_in_pages_readable function")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-18 16:31:45 +02:00
Uwe Kleine-König
4141127c44 powerpc/eeh: Use to_pci_driver() instead of pci_dev->driver
Struct pci_driver contains a struct device_driver, so for PCI devices, it's
easy to convert a device_driver * to a pci_driver * with to_pci_driver().
The device_driver * is in struct device, so we don't need to also keep
track of the pci_driver * in struct pci_dev.

Replace pdev->driver with to_pci_driver().  This is a step toward removing
pci_dev->driver.

[bhelgaas: split to separate patch]
Link: https://lore.kernel.org/r/20211004125935.2300113-11-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-10-18 09:20:15 -05:00
Christoph Hellwig
e41d12f539 mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
break the include chain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Linus Torvalds
be9eb2f00f powerpc fixes for 5.15 #4
Fix a bug where guests on P9 with interrupts passed through could get stuck in
 synchronize_irq().
 
 Fix a bug in KVM on P8 where secondary threads entering a guest would write outside their
 allocated stack.
 
 Fix a bug in KVM on P8 where secondary threads could confuse the host offline code and
 cause the guest or host to crash.
 
 Thanks to: Cédric Le Goater
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmFsFZkTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgBQND/0VSdTHegF+7kBqWnyP+Rc8aXwr/kPL
 xijmmeWS+BY0nkW7BNabRwVtOHZ4lKyDxPMwFiQ+tbpsmm4LX9sA9igV2yXdK73Y
 pRFkJBxljxXuO3eYa4peQwsvHnI7cR//fUr94b0OT6r0N39T5tKq5MZvWawk05y4
 RivpMOnEtDEd12ySJk5/fshwNKV5L7InI2FRKi3EU8sPNV3XA/z5JQ1CKuvXUUiK
 1HIwwxYqECcMM0dkIo7Qa7F93Wr0V8ZjsXLFV96jpBNL2aUfXVMaIBmITwa1Pk+i
 1581QmI4AJx3b3YAMqoujrgFJQKxaw2ZjiFfDcHGcTD4qCNVH2MNeyHM86Z7lFKL
 osYI7Ya3zzlwd4qhPsLV21mBdBp1xjpiBj1FiRrq4o+Me+O6eQnmG2abL8xUFI/3
 F7ab9qV8xjqfw/XlAXKtb7cco4SSQfi6XuLYSdTduCU5A1unLgpBw9F6H14YrQ5A
 8/uriGLIzhgklwDYgBJUGOjJzMCv0ey1+MJF4R4tEeIJ7+83WuAHCf2eAnxSVXLT
 Q2JaFt1LP8fdW1IPYYICjVgahREsWvTW4wK2lk5sO+BSoop6EbOVfz5TK52ePxLC
 II4ETYG3tIipLCkf2tqF/rU4De7d2SRg70lCNdEBVU9GpmiXec4uwPyX3uAwKdlf
 9qphXh0d15LAPg==
 =fPMr
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.15-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix a bug where guests on P9 with interrupts passed through could get
   stuck in synchronize_irq().

 - Fix a bug in KVM on P8 where secondary threads entering a guest would
   write outside their allocated stack.

 - Fix a bug in KVM on P8 where secondary threads could confuse the host
   offline code and cause the guest or host to crash.

Thanks to Cédric Le Goater.

* tag 'powerpc-5.15-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest
  KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()
  powerpc/xive: Discard disabled interrupts in get_irqchip_state()
2021-10-17 18:01:32 -10:00
Michael Ellerman
cdeb5d7d89 KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest
We call idle_kvm_start_guest() from power7_offline() if the thread has
been requested to enter KVM. We pass it the SRR1 value that was returned
from power7_idle_insn() which tells us what sort of wakeup we're
processing.

Depending on the SRR1 value we pass in, the KVM code might enter the
guest, or it might return to us to do some host action if the wakeup
requires it.

If idle_kvm_start_guest() is able to handle the wakeup, and enter the
guest it is supposed to indicate that by returning a zero SRR1 value to
us.

That was the behaviour prior to commit 10d91611f426 ("powerpc/64s:
Reimplement book3s idle code in C"), however in that commit the
handling of SRR1 was reworked, and the zeroing behaviour was lost.

Returning from idle_kvm_start_guest() without zeroing the SRR1 value can
confuse the host offline code, causing the guest to crash and other
weirdness.

Fixes: 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211015133929.832061-2-mpe@ellerman.id.au
2021-10-16 00:40:03 +11:00
Michael Ellerman
9b4416c509 KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()
In commit 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in
C") kvm_start_guest() became idle_kvm_start_guest(). The old code
allocated a stack frame on the emergency stack, but didn't use the
frame to store anything, and also didn't store anything in its caller's
frame.

idle_kvm_start_guest() on the other hand is written more like a normal C
function, it creates a frame on entry, and also stores CR/LR into its
callers frame (per the ABI). The problem is that there is no caller
frame on the emergency stack.

The emergency stack for a given CPU is allocated with:

  paca_ptrs[i]->emergency_sp = alloc_stack(limit, i) + THREAD_SIZE;

So emergency_sp actually points to the first address above the emergency
stack allocation for a given CPU, we must not store above it without
first decrementing it to create a frame. This is different to the
regular kernel stack, paca->kstack, which is initialised to point at an
initial frame that is ready to use.

idle_kvm_start_guest() stores the backchain, CR and LR all of which
write outside the allocation for the emergency stack. It then creates a
stack frame and saves the non-volatile registers. Unfortunately the
frame it creates is not large enough to fit the non-volatiles, and so
the saving of the non-volatile registers also writes outside the
emergency stack allocation.

The end result is that we corrupt whatever is at 0-24 bytes, and 112-248
bytes above the emergency stack allocation.

In practice this has gone unnoticed because the memory immediately above
the emergency stack happens to be used for other stack allocations,
either another CPUs mc_emergency_sp or an IRQ stack. See the order of
calls to irqstack_early_init() and emergency_stack_init().

The low addresses of another stack are the top of that stack, and so are
only used if that stack is under extreme pressue, which essentially
never happens in practice - and if it did there's a high likelyhood we'd
crash due to that stack overflowing.

Still, we shouldn't be corrupting someone else's stack, and it is purely
luck that we aren't corrupting something else.

To fix it we save CR/LR into the caller's frame using the existing r1 on
entry, we then create a SWITCH_FRAME_SIZE frame (which has space for
pt_regs) on the emergency stack with the backchain pointing to the
existing stack, and then finally we switch to the new frame on the
emergency stack.

Fixes: 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211015133929.832061-1-mpe@ellerman.id.au
2021-10-16 00:39:54 +11:00
Kees Cook
42a20f86dc sched: Add wrapper for get_wchan() to keep task blocked
Having a stable wchan means the process must be blocked and for it to
stay that way while performing stack unwinding.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm]
Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Link: https://lkml.kernel.org/r/20211008111626.332092234@infradead.org
2021-10-15 11:25:14 +02:00