IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Create a function for saving SPRs before entering deep idle states.
This function can be reused for POWER9 deep idle states.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pnv_powersave_common does common steps needed before entering idle
state and eventually changes MSR to MSR_IDLE and does rfid to
pnv_enter_arch207_idle_mode.
Move the updation of HSTATE_HWTHREAD_STATE to pnv_powersave_common
from pnv_enter_arch207_idle_mode and make it more generic by passing the
rfid address as a function parameter.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Functions like power7_wakeup_loss, power7_wakeup_noloss,
power7_wakeup_tb_loss are used by POWER7 and POWER8 hardware. They can
also be used by POWER9. Hence rename these functions hardware agnostic
names.
Suggested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
idle_power7.S handles idle entry/exit for POWER7, POWER8 and in next
patch for POWER9. Rename the file to a non-hardware specific
name.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In the current code, when the thread wakes up in reset vector, some
of the state restore code and check for whether a thread needs to
branch to kvm is duplicated. Reorder the code such that this
duplication is avoided.
At a higher level this is what the change looks like-
Before this patch -
power7_wakeup_tb_loss:
restore hypervisor state
if (thread needed by kvm)
goto kvm_start_guest
restore nvgprs, cr, pc
rfid to process context
power7_wakeup_loss:
restore nvgprs, cr, pc
rfid to process context
reset vector:
if (waking from deep idle states)
goto power7_wakeup_tb_loss
else
if (thread needed by kvm)
goto kvm_start_guest
goto power7_wakeup_loss
After this patch -
power7_wakeup_tb_loss:
restore hypervisor state
return
power7_restore_hyp_resource():
if (waking from deep idle states)
goto power7_wakeup_tb_loss
return
power7_wakeup_loss:
restore nvgprs, cr, pc
rfid to process context
reset vector:
power7_restore_hyp_resource()
if (thread needed by kvm)
goto kvm_start_guest
goto power7_wakeup_loss
Reviewed-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the start of __tm_recheckpoint() we save the kernel stack pointer
(r1) in SPRG SCRATCH0 (SPRG2) so that we can restore it after the
trecheckpoint.
Unfortunately, the same SPRG is used in the SLB miss handler. If an
SLB miss is taken between the save and restore of r1 to the SPRG, the
SPRG is changed and hence r1 is also corrupted. We can end up with
the following crash when we start using r1 again after the restore
from the SPRG:
Oops: Bad kernel stack pointer, sig: 6 [#1]
SMP NR_CPUS=2048 NUMA pSeries
CPU: 658 PID: 143777 Comm: htm_demo Tainted: G EL X 4.4.13-0-default #1
task: c0000b56993a7810 ti: c00000000cfec000 task.ti: c0000b56993bc000
NIP: c00000000004f188 LR: 00000000100040b8 CTR: 0000000010002570
REGS: c00000000cfefd40 TRAP: 0300 Tainted: G EL X (4.4.13-0-default)
MSR: 8000000300001033 <SF,ME,IR,DR,RI,LE> CR: 02000424 XER: 20000000
CFAR: c000000000008468 DAR: 00003ffd84e66880 DSISR: 40000000 SOFTE: 0
PACATMSCRATCH: 00003ffbc865e680
GPR00: fffffffcfabc4268 00003ffd84e667a0 00000000100d8c38 000000030544bb80
GPR04: 0000000000000002 00000000100cf200 0000000000000449 00000000100cf100
GPR08: 000000000000c350 0000000000002569 0000000000002569 00000000100d6c30
GPR12: 00000000100d6c28 c00000000e6a6b00 00003ffd84660000 0000000000000000
GPR16: 0000000000000003 0000000000000449 0000000010002570 0000010009684f20
GPR20: 0000000000800000 00003ffd84e5f110 00003ffd84e5f7a0 00000000100d0f40
GPR24: 0000000000000000 0000000000000000 0000000000000000 00003ffff0673f50
GPR28: 00003ffd84e5e960 00000000003d0f00 00003ffd84e667a0 00003ffd84e5e680
NIP [c00000000004f188] restore_gprs+0x110/0x17c
LR [00000000100040b8] 0x100040b8
Call Trace:
Instruction dump:
f8a1fff0 e8e700a8 38a00000 7ca10164 e8a1fff8 e821fff0 7c0007dd 7c421378
7db142a6 7c3242a6 38800002 7c810164 <e9c100e0> e9e100e8 ea0100f0 ea2100f8
We hit this on large memory machines (> 2TB) but it can also be hit on
smaller machines when 1TB segments are disabled.
To hit this, you also need to be virtualised to ensure SLBs are
periodically removed by the hypervisor.
This patches moves the saving of r1 to the SPRG to the region where we
are guaranteed not to take any further SLB misses.
Fixes: 98ae22e15b43 ("powerpc: Add helper functions for transactional memory context switching")
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- tm: Always reclaim in start_thread() for exec() class syscalls from Cyril Bur
- tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0 from Michael Neuling
- eeh: Fix wrong argument passed to eeh_rmv_device() from Gavin Shan
- Initialise pci_io_base as early as possible from Darren Stevens
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXeEmsAAoJEFHr6jzI4aWAwMMQAKs/u9rwB3gpOkNJSHajN1Dd
kdqDufzLxLDwbWnMfqM1+bcO2EOjPhKbtpbzhG6oeiET8undRRoLsjHS5rZeYK5h
cviRPEJ/Yz8ZWaIgFGI8+02gXwU0MJhuTY8NPexXmsh4FRdKYwEuCIJShl30lg22
P7UrJ2SCNM+H/uZyS07B7thiwBeAKSp6VkLTpuW/QDz2j1ra/F22dTh7c0Agdahd
INAMAnh9nYeuMVYn4XjOOlQ07JnBTuf1/W5Wxlw4i/86rVq+Hy8zh5r1X52oysR5
lZl63B9q3agKG9cc9lSN2ibTDVerlFMwB2QysX2a6Uy7+y2SB3hS7VS1RTXCh3hg
/omApGGVW3Hh+E2CuKfFLQySU55NRpLAoTGravGr/KsH4wZP/n/fkrctldCrqm7P
sTPT52+t+iJQk4fiskRY3yQ7DTTnt3rTC8MJRGqvLuCheolLll4NQaWOF75AJP+7
WFWtC4QHOTPERMkhqLnZDG2vNuDg1H8chuZ2+PxtIs6G1vuOEun+MTZAYh4u6XWE
bAIT9rV3xBdE17bzYOQz7lU1y7yNVtP7xkm0HIOAHlU4gUrjQp5u8F3TnPW3/M0m
8GeaZdrPjhsaNg31YZODAeM8Ddf+N9d2a2VPIr/fzytURhMe0ss3Z/MdMoYRATab
Lh1o+G3gDo9MVaphoJ3w
=oEAY
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.7-5' into next
Pull in the fixes we sent during 4.7, we have code we want to merge into
next that depends on some of them.
powernv marks it's halt and restart calls as __noreturn. However,
ppc_md does not have this annotation. Add the annotation to ppc_md,
and then to every halt/restart function that is missing it.
Additionally, I have verified that all of these functions do not
return. Occasionally I have added a spin loop to be sure.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The array crash_shutdown_handles[] has size CRASH_HANDLER_MAX, thus when
we loop over the elements of the list we check crash_shutdown_handles[i]
&& i < CRASH_HANDLER_MAX. However this means that when we increment i to
CRASH_HANDLER_MAX we will perform an out of bound array access checking
the first condition before exiting on the second condition.
To avoid the out of bounds access, simply reorder the loop conditions.
Fixes: 1d1451655bad ("powerpc: Add array bounds checking to crash_shutdown_handlers")
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The subsequent test for RTAS along with the LPAR test are sufficient
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The test is unnecessary, the FW_FEATURE_LPAR is sufficient as there
exist no other LPAR type that has RTAS.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The function is called by both 32-bit and 64-bit early setup right
after early_init_devtree(). All it does is run yet another early
DT parser which is precisely what early_init_devtree() is about,
so move it in there.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anything in early_setup() needs to be justified to be there, in
this case, we need the trampolines before we can take exceptions
and thus before we turn on the MMU.
Also remove a pretty meaningless and misplaced debug message
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Fix comment formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
early_init() is called in-place before kernel relocation and using
whatever MMU setup exists at the point the kernel is entered.
machine_init() is called after relocation and after some initial
mapping of PAGE_OFFSET has been established (typically using BATs
on 6xx/7xx/7xxx processors or some form of bolted TLB on others).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The asm-offsets mechanism generates signed numbers, even if the
input value is explicitly unsigned. This causes a problem with
older binutils (e.g. 2.23), which sign-extend a negative number
when @h is applied. Thus, this instruction:
cmpli cr0, r11, VIRT_IMMR_BASE@h
resulted in this:
Error: operand out of range (0xfffffff0 is not between 0x00000000 and
0x0000ffff)
By casting to a larger type, we can force the output to be expressed
as a positive number.
Signed-off-by: Scott Wood <oss@buserror.net>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
CONFIG_PIN_TLB maps IMMR area and the first 24 Mbytes of memory.
In some circunstances it might be more interesting to not map
IMMR but map 32 Mbytes of memory instead.
Therefore we add config option CONFIG_PIN_TLB_IMMR to select if
IMMR shall be pinned or not, hence whether we pin 24 or 32 Mbytes of RAM
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
On recent kernels, with some debug options like for instance
CONFIG_LOCKDEP, the BSS requires more than 8M memory, allthough
the kernel code fits in the first 8M.
Today, it is necessary to activate CONFIG_PIN_TLB to get more than 8M
at startup, allthough pinning TLB is not necessary for that.
We could have inconditionaly mapped 16 or 24M bytes at startup
but some old hardware only have 8M and mapping non-existing RAM
would be an issue due to speculative accesses.
With the preceding patch however, the TLB entries are populated on
demand. By setting up the TLB miss handler to handle up to 24M until
the handler is patched for the entire memory space, it is possible
to allow access up to more memory without mapping non-existing RAM.
It is therefore not needed anymore to map memory data at all
at startup. It will be handled by the TLB miss handler.
One might still want to PIN the IMMR and the first 24M of RAM.
It is now possible to do it in the C memory initialisation
functions. In addition, we now know how much memory we have
when we do it, so we are able to adapt the pining to the
real amount of memory available. So boards with less than 24M
can now also benefit from PIN_TLB.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
Instead of using the first level page table to define mappings for
the linear memory space, we can use direct mapping from the TLB
handling routines. This has several advantages:
* No need to read the tables at each TLB miss
* No issue in 16k pages mode where the 1st level table maps 64 Mbytes
The size of the available linear space is known at system startup.
In order to avoid data access at each TLB miss to know the memory
size, the TLB routine is patched at startup with the proper size
This patch provides a 10%-15% improvment of TLB miss handling for
kernel addresses
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
Bootloader may have pinned some TLB entries so the kernel must
unpin them before flushing TLBs with tlbia otherwise pinned TLB
entries won't get flushed
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
Once the linear memory space has been mapped with 8Mb pages, as
seen in the related commit, we get 11 millions DTLB missed during
the reference 600s period. 77% of the misses are on user addresses
and 23% are on kernel addresses (1 fourth for linear address space
and 3 fourth for virtual address space)
Traditionaly, each driver manages one computer board which has its
own components with its own memory maps.
But on embedded chips like the MPC8xx, the SOC has all registers
located in the same IO area.
When looking at ioremaps done during startup, we see that
many drivers are re-mapping small parts of the IMMR for their own use
and all those small pieces gets their own 4k page, amplifying the
number of TLB misses: in our system we get 0xff000000 mapped 31 times
and 0xff003000 mapped 9 times.
Even if each part of IMMR was mapped only once with 4k pages, it would
still be several small mappings towards linear area.
This patch maps the IMMR with a single 512k page.
With this patch applied, the number of DTLB misses during the 10 min
period is reduced to 11.8 millions for a duration of 5.8s, which
represents 2% of the non-idle time hence yet another 10% reduction.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
Memory: 124428K/131072K available (3748K kernel code, 188K rwdata,
648K rodata, 508K init, 290K bss, 6644K reserved)
Kernel virtual memory layout:
* 0xfffdf000..0xfffff000 : fixmap
* 0xfde00000..0xfe000000 : consistent mem
* 0xfddf6000..0xfde00000 : early ioremap
* 0xc9000000..0xfddf6000 : vmalloc & ioremap
SLUB: HWalign=16, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
Today, IMMR is mapped 1:1 at startup
Mapping IMMR 1:1 is just wrong because it may overlap with another
area. On most mpc8xx boards it is OK as IMMR is set to 0xff000000
but for instance on EP88xC board, IMMR is at 0xfa200000 which
overlaps with VM ioremap area
This patch fixes the virtual address for remapping IMMR with the fixmap
regardless of the value of IMMR.
The size of IMMR area is 256kbytes (CPM at offset 0, security engine
at offset 128k) so a 512k page is enough
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
This patch provides VIRT_CPU_ACCOUTING to PPC32 architecture.
PPC32 doesn't have the PACA structure, so we use the task_info
structure to store the accounting data.
In order to reuse on PPC32 the PPC64 functions, all u64 data has
been replaced by 'unsigned long' so that it is u32 on PPC32 and
u64 on PPC64
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
eeh_cache.c doesn't build cleanly with -DDEBUG when
CONFIG_PHYS_ADDR_T_64BIT is set, as a couple of pr_debug()s use "%lx" for
resource_size_t parameters.
Use "%pap" instead, as it's the correct format specifier for types deriving
from phys_addr_t.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A strange behaviour is observed when comparing PCI hotplug in QEMU, between
x86 and pseries. If you consider the following steps:
- start a VM
- add a PCI device via the QEMU monitor before the rtasd has started (for
example starting the VM in paused state, or hotplug during FW or boot
loader)
- resume the VM execution
The x86 kernel detects the PCI device, but the pseries one does not.
This happens because the rtasd kernel worker is currently started under
device_initcall, while PCI probing happens earlier under subsys_initcall.
As a consequence, if we have a pending RTAS event at boot time, a message
is printed and the event is dropped.
This patch moves all the initialization of rtasd to arch_initcall, which is
run before subsys_call: this way, logging_enabled is true when the RTAS
event pops up and it is not lost anymore.
The proc fs bits stay at device_initcall because they cannot be run before
fs_initcall.
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The domain/PHB field of PCI addresses has its value obtained from a
global variable, incremented each time a new domain (represented by
struct pci_controller) is added on the system. The domain addition
process happens during boot or due to PHB hotplug add.
As recent kernels are using predictable naming for network interfaces,
the network stack is more tied to PCI naming. This can be a problem in
hotplug scenarios, because PCI addresses will change if devices are
removed and then re-added. This situation seems unusual, but it can
happen if a user wants to replace a NIC without rebooting the machine,
for example.
This patch changes the way PCI domain values are generated: now, we use
device-tree properties to assign fixed PHB numbers to PCI addresses
when available (meaning pSeries and PowerNV cases). We also use a bitmap
to allow dynamic PHB numbering when device-tree properties are not
used. This bitmap keeps track of used PHB numbers and if a PHB is
released (by hotplug operations for example), it allows the reuse of
this PHB number, avoiding PCI address to change in case of device remove
and re-add soon after. No functional changes were introduced.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Ian Munsie <imunsie@au1.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
[mpe: Drop unnecessary machine_is(pseries) test]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Despite attempting to fix this in commit fb36e9073693 ("powerpc/pci: Fix
SRIOV not building without EEH enabled"), the build is still broken when
PCI_IOV=y and EEH=n (eg. g5_defconfig with PCI_IOV=y):
arch/powerpc/kernel/pci_dn.c: In function ‘remove_dev_pci_data’:
arch/powerpc/kernel/pci_dn.c:230:18: error: unused variable ‘edev’
Incorporate Ben's idea of using __maybe_unused to avoid so many #ifdefs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Power ISAv3 adds a large decrementer (LD) mode which increases the size
of the decrementer register. The size of the enlarged decrementer
register is between 32 and 64 bits with the exact size being dependent
on the implementation. When in LD mode, reads are sign extended to 64
bits and a decrementer exception is raised when the high bit is set (i.e
the value goes below zero). Writes however are truncated to the physical
register width so some care needs to be taken to ensure that the high
bit is not set when reloading the decrementer. This patch adds support
for using the LD inside the host kernel on processors that support it.
When LD mode is supported firmware will supply the ibm,dec-bits property
for CPU nodes to allow the kernel to determine the maximum decrementer
value. Enabling LD mode is a hypervisor privileged operation so the kernel
can only enable it manually when running in hypervisor mode. Guests that
support LD mode can request it using the "ibm,client-architecture-support"
firmware call (not implemented in this patch) or some other platform
specific method. If this property is not supplied then the traditional
decrementer width of 32 bit is assumed and LD mode will not be enabled.
This patch was based on initial work by Jack Miller.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If ppc_rtas() is called with args.nargs == 16 and args.nret == 0,
args.rets is set to point to &args.args[16], which is beyond the end of
the args.args array. This results in a minor read overrun of the array
when we check the first return code (which, per PAPR, is a required
output of all RTAS calls) to see if there's been a hardware error.
Change the nargs/nret check to ensure nargs is <= 15, allowing room for
the status code. Users shouldn't be calling with nret == 0, but there's
no real harm if they do, so we don't stop them.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Calling ISA 3.0 instructions copy, copy_first, paste and paste_last
generates an alignment fault when copying or pasting unaligned
data (128 byte). We catch this and send SIGBUS to the userspace
process that caused it.
We do not emulate these because paste may contain additional metadata
when pasting to a co-processor and paste_last is the synchronisation
point for preceding copy/paste sequences.
Thanks to Michael Neuling <mikey@neuling.org> for his help.
Signed-off-by: Chris Smart <chris@distroguy.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Factor out the power8 pmu init functions to share with
power9. Monitor Mode Control Register S(MMCRS) and
Monitor Mode Control Register H(MMCRH) registers are
dropped in Power9. These registers are added to new
function which are included for power8 init.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We spent so much time bike-shedding the printk() we missed that the next
line was missing a semi-colon. And it seems none of our defconfigs turn
on CONFIG_FA_DUMP.
Fixes: 4a03749f140c ("powerpc/fadump: Trivial fix of spelling mistake, clean up message")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit d6a9996e84ac ("powerpc/mm: vmalloc abstraction in preparation for
radix") turned kernel memory and IO addresses from #defined constants to
variables initialised at runtime.
On PA6T (pasemi) systems the setup_arch() machine call initialises the
onboard PCI-e root-ports, and uses pci_io_base to do this, which is now
before its value has been set, resulting in a panic early in boot before
console IO is initialised.
Move the pci_io_base initialisation to the same place as vmalloc ranges
are set (hash__early_init_mmu()/radix__early_init_mmu()) - this is the
earliest possible place we can initialise it.
Fixes: d6a9996e84ac ("powerpc/mm: vmalloc abstraction in preparation for radix")
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Darren Stevens <darren@stevens-zone.net>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Add #ifdef CONFIG_PCI, massage change log slightly]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we have 2 segments that are bolted for the kernel linear
mapping (ie 0xc000... addresses). This is 0 to 1TB and also the kernel
stacks. Anything accessed outside of these regions may need to be
faulted in. (In practice machines with TM always have 1T segments)
If a machine has < 2TB of memory we never fault on the kernel linear
mapping as these two segments cover all physical memory. If a machine
has > 2TB of memory, there may be structures outside of these two
segments that need to be faulted in. This faulting can occur when
running as a guest as the hypervisor may remove any SLB that's not
bolted.
When we treclaim and trecheckpoint we have a window where we need to
run with the userspace GPRs. This means that we no longer have a valid
stack pointer in r1. For this window we therefore clear MSR RI to
indicate that any exceptions taken at this point won't be able to be
handled. This means that we can't take segment misses in this RI=0
window.
In this RI=0 region, we currently access the thread_struct for the
process being context switched to or from. This thread_struct access
may cause a segment fault since it's not guaranteed to be covered by
the two bolted segment entries described above.
We've seen this with a crash when running as a guest with > 2TB of
memory on PowerVM:
Unrecoverable exception 4100 at c00000000004f138
Oops: Unrecoverable exception, sig: 6 [#1]
SMP NR_CPUS=2048 NUMA pSeries
CPU: 1280 PID: 7755 Comm: kworker/1280:1 Tainted: G X 4.4.13-46-default #1
task: c000189001df4210 ti: c000189001d5c000 task.ti: c000189001d5c000
NIP: c00000000004f138 LR: 0000000010003a24 CTR: 0000000010001b20
REGS: c000189001d5f730 TRAP: 4100 Tainted: G X (4.4.13-46-default)
MSR: 8000000100001031 <SF,ME,IR,DR,LE> CR: 24000048 XER: 00000000
CFAR: c00000000004ed18 SOFTE: 0
GPR00: ffffffffc58d7b60 c000189001d5f9b0 00000000100d7d00 000000003a738288
GPR04: 0000000000002781 0000000000000006 0000000000000000 c0000d1f4d889620
GPR08: 000000000000c350 00000000000008ab 00000000000008ab 00000000100d7af0
GPR12: 00000000100d7ae8 00003ffe787e67a0 0000000000000000 0000000000000211
GPR16: 0000000010001b20 0000000000000000 0000000000800000 00003ffe787df110
GPR20: 0000000000000001 00000000100d1e10 0000000000000000 00003ffe787df050
GPR24: 0000000000000003 0000000000010000 0000000000000000 00003fffe79e2e30
GPR28: 00003fffe79e2e68 00000000003d0f00 00003ffe787e67a0 00003ffe787de680
NIP [c00000000004f138] restore_gprs+0xd0/0x16c
LR [0000000010003a24] 0x10003a24
Call Trace:
[c000189001d5f9b0] [c000189001d5f9f0] 0xc000189001d5f9f0 (unreliable)
[c000189001d5fb90] [c00000000001583c] tm_recheckpoint+0x6c/0xa0
[c000189001d5fbd0] [c000000000015c40] __switch_to+0x2c0/0x350
[c000189001d5fc30] [c0000000007e647c] __schedule+0x32c/0x9c0
[c000189001d5fcb0] [c0000000007e6b58] schedule+0x48/0xc0
[c000189001d5fce0] [c0000000000deabc] worker_thread+0x22c/0x5b0
[c000189001d5fd80] [c0000000000e7000] kthread+0x110/0x130
[c000189001d5fe30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
Instruction dump:
7cb103a6 7cc0e3a6 7ca222a6 78a58402 38c00800 7cc62838 08860000 7cc000a6
38a00006 78c60022 7cc62838 0b060000 <e8c701a0> 7ccff120 e8270078 e8a70098
---[ end trace 602126d0a1dedd54 ]---
This fixes this by copying the required data from the thread_struct to
the stack before we clear MSR RI. Then once we clear RI, we only access
the stack, guaranteeing there's no segment miss.
We also tighten the region over which we set RI=0 on the treclaim()
path. This may have a slight performance impact since we're adding an
mtmsr instruction.
Fixes: 090b9284d725 ("powerpc/tm: Clear MSR RI in non-recoverable TM code")
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When calling eeh_rmv_device() in eeh_reset_device() for partial hotplug
case, @rmv_data instead of its address is the proper argument.
Otherwise, the stack frame is corrupted when writing to
@rmv_data (actually its address) in eeh_rmv_device(). It results in
kernel crash as observed.
This fixes the issue by passing @rmv_data, not its address to
eeh_rmv_device() in eeh_reset_device().
Fixes: 67086e32b564 ("powerpc/eeh: powerpc/eeh: Support error recovery for VF PE")
Reported-by: Pridhiviraj Paidipeddi <ppaidipe@in.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fix trivial spelling mistake "rgistration". Also use pr_err()
instead of printk() and unsplit the string to keep it all on one
line.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
[mpe: Keep rc on the same line, splitting it doesn't help]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Userspace can quite legitimately perform an exec() syscall with a
suspended transaction. exec() does not return to the old process, rather
it load a new one and starts that, the expectation therefore is that the
new process starts not in a transaction. Currently exec() is not treated
any differently to any other syscall which creates problems.
Firstly it could allow a new process to start with a suspended
transaction for a binary that no longer exists. This means that the
checkpointed state won't be valid and if the suspended transaction were
ever to be resumed and subsequently aborted (a possibility which is
exceedingly likely as exec()ing will likely doom the transaction) the
new process will jump to invalid state.
Secondly the incorrect attempt to keep the transactional state while
still zeroing state for the new process creates at least two TM Bad
Things. The first triggers on the rfid to return to userspace as
start_thread() has given the new process a 'clean' MSR but the suspend
will still be set in the hardware MSR. The second TM Bad Thing triggers
in __switch_to() as the processor is still transactionally suspended but
__switch_to() wants to zero the TM sprs for the new process.
This is an example of the outcome of calling exec() with a suspended
transaction. Note the first 700 is likely the first TM bad thing
decsribed earlier only the kernel can't report it as we've loaded
userspace registers. c000000000009980 is the rfid in
fast_exception_return()
Bad kernel stack pointer 3fffcfa1a370 at c000000000009980
Oops: Bad kernel stack pointer, sig: 6 [#1]
CPU: 0 PID: 2006 Comm: tm-execed Not tainted
NIP: c000000000009980 LR: 0000000000000000 CTR: 0000000000000000
REGS: c00000003ffefd40 TRAP: 0700 Not tainted
MSR: 8000000300201031 <SF,ME,IR,DR,LE,TM[SE]> CR: 00000000 XER: 00000000
CFAR: c0000000000098b4 SOFTE: 0
PACATMSCRATCH: b00000010000d033
GPR00: 0000000000000000 00003fffcfa1a370 0000000000000000 0000000000000000
GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR12: 00003fff966611c0 0000000000000000 0000000000000000 0000000000000000
NIP [c000000000009980] fast_exception_return+0xb0/0xb8
LR [0000000000000000] (null)
Call Trace:
Instruction dump:
f84d0278 e9a100d8 7c7b03a6 e84101a0 7c4ff120 e8410170 7c5a03a6 e8010070
e8410080 e8610088 e8810090 e8210078 <4c000024> 48000000 e8610178 88ed023b
Kernel BUG at c000000000043e80 [verbose debug info unavailable]
Unexpected TM Bad Thing exception at c000000000043e80 (msr 0x201033)
Oops: Unrecoverable exception, sig: 6 [#2]
CPU: 0 PID: 2006 Comm: tm-execed Tainted: G D
task: c0000000fbea6d80 ti: c00000003ffec000 task.ti: c0000000fb7ec000
NIP: c000000000043e80 LR: c000000000015a24 CTR: 0000000000000000
REGS: c00000003ffef7e0 TRAP: 0700 Tainted: G D
MSR: 8000000300201033 <SF,ME,IR,DR,RI,LE,TM[SE]> CR: 28002828 XER: 00000000
CFAR: c000000000015a20 SOFTE: 0
PACATMSCRATCH: b00000010000d033
GPR00: 0000000000000000 c00000003ffefa60 c000000000db5500 c0000000fbead000
GPR04: 8000000300001033 2222222222222222 2222222222222222 00000000ff160000
GPR08: 0000000000000000 800000010000d033 c0000000fb7e3ea0 c00000000fe00004
GPR12: 0000000000002200 c00000000fe00000 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 c0000000fbea7410 00000000ff160000
GPR24: c0000000ffe1f600 c0000000fbea8700 c0000000fbea8700 c0000000fbead000
GPR28: c000000000e20198 c0000000fbea6d80 c0000000fbeab680 c0000000fbea6d80
NIP [c000000000043e80] tm_restore_sprs+0xc/0x1c
LR [c000000000015a24] __switch_to+0x1f4/0x420
Call Trace:
Instruction dump:
7c800164 4e800020 7c0022a6 f80304a8 7c0222a6 f80304b0 7c0122a6 f80304b8
4e800020 e80304a8 7c0023a6 e80304b0 <7c0223a6> e80304b8 7c0123a6 4e800020
This fixes CVE-2016-5828.
Fixes: bc2a9408fa65 ("powerpc: Hook in new transactional memory code")
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If a PHB has no I/O space, there's no need to make it look like
something bad happened, a pr_debug() is plenty enough since this
is the case of all our modern POWER chips.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As part of the Radix MMU support we added some feature sections in the
SLB miss handler. These are intended to catch the case that we
incorrectly take an SLB miss when Radix is enabled, and instead of
crashing weirdly they bail out to a well defined exit path and trigger
an oops.
However the way they were written meant the bailout case was enabled by
default until we did CPU feature patching.
On powermacs the early debug prints in setup_system() can cause an SLB
miss, which happens before code patching, and so the SLB miss handler
would incorrectly bailout and crash during boot.
Fix it by inverting the sense of the feature section, so that the code
which is in place at boot is correct for the hash case. Once we
determine we are using Radix - which will never happen on a powermac -
only then do we patch in the bailout case which unconditionally jumps.
Fixes: caca285e5ab4 ("powerpc/mm/radix: Use STD_MMU_64 to properly isolate hash related code")
Reported-by: Denis Kirjanov <kda@linux-powerpc.org>
Tested-by: Denis Kirjanov <kda@linux-powerpc.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The pdn (struct pci_dn) instances are allocated from memblock or
bootmem when creating PCI controller (hoses) in setup_arch(). PCI
hotplug, which will be supported by proceeding patches, releases
PCI device nodes and their corresponding pdn on unplugging event.
The memory chunks for pdn instances allocated from memblock or
bootmem are hard to reused after being released.
This delays creating pdn by pci_devs_phb_init() from setup_arch()
to core_initcall() so that they are allocated from slab. The memory
consumed by pdn can be released to system without problem during
PCI unplugging time. It indicates that pci_dn is unavailable in
setup_arch() and the the fixup on pdn (like AGP's) can't be carried
out that time. We have to do that in pcibios_root_bridge_prepare()
on maple/pasemi/powermac platforms where/when the pdn is available.
pcibios_root_bridge_prepare is called from subsys_initcall() which
is executed after core_initcall() so the code flow does not change.
At the mean while, the EEH device is created when pdn is populated,
meaning pdn and EEH device have same life cycle. In turn, we needn't
call eeh_dev_init() to create EEH device explicitly.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On the PCI plugging event, PCI slot's subordinate devices are
scanned and their (IO and MMIO) resources are assigned. Platform
dependent resources (PE#, IO/MMIO/DMA windows) are allocated or
created on updating windows of the slot's upstream bridge.
This updates the windows of the hot plugged slot's upstream bridge
in pcibios_finish_adding_to_bus() so that the platform resources
(PE#, IO/MMIO/DMA segments) are allocated or created accordingly.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This overrides pcibios_setup_bridge() that is called to update PCI
bridge windows when PCI resource assignment is completed, to assign
PE and setup various (resource) mapping for the PE in subsequent
patches.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Export cpu_to_core_id(). This will be used by the lpfc driver.
This enables topology_core_id() from <linux/topology.h> (defined
to cpu_to_core_id() in arch/powerpc/include/asm/topology.h) to be
used by (non-builtin) modules.
That is arch-neutral, already used by eg, drivers/base/topology.c,
but it is builtin (obj-y in Makefile) thus didn't need the export.
Since the module uses topology_core_id() and this is defined to
cpu_to_core_id(), it needs the export, otherwise:
ERROR: "cpu_to_core_id" [drivers/scsi/lpfc/lpfc.ko] undefined!
Tested on next-20160601.
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This enables new registers, LMRR and LMSER, that can trigger an EBB in
userspace code when a monitored load (via the new ldmx instruction)
loads memory from a monitored space. This facility is controlled by a
new FSCR bit, LM.
This patch disables the FSCR LM control bit on task init and enables
that bit when a load monitor facility unavailable exception is taken
for using it. On context switch, this bit is then used to determine
whether the two relevant registers are saved and restored. This is
done lazily for performance reasons.
Signed-off-by: Jack Miller <jack@codezen.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This fixes a few issues with FSCR init and switching.
In commit 152d523e6307 ("powerpc: Create context switch helpers
save_sprs() and restore_sprs()") we moved the setting of the FSCR
register from inside an CPU_FTR_ARCH_207S section to inside just a
CPU_FTR_ARCH_DSCR section. Hence we are setting FSCR on POWER6/7 where
the FSCR doesn't exist. This is harmless but we shouldn't do it.
Also, we can simplify the FSCR context switch. We don't need to go
through the calculation involving dscr_inherit. We can just restore
what we saved last time.
We also set an initial value in INIT_THREAD, so that pid 1 which is
cloned from that gets a sane value.
Based on patch by Jack Miller.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Current comment in the early_setup_secondary() for paca->soft_enabled
update is misleading. Comment should say to Mark interrupts "disabled"
instead of "enabled". Fix the typo.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fixes the following testsuite failure:
$ sudo ./perf test -v kallsyms
1: vmlinux symtab matches kallsyms :
--- start ---
test child forked, pid 12489
Using /proc/kcore for kernel object code
Looking at the vmlinux_path (8 entries long)
Using /boot/vmlinux for symbols
0xc00000000003d300: diff name v: .kretprobe_trampoline_holder k: kretprobe_trampoline
Maps only in vmlinux:
c00000000086ca38-c000000000879b6c 87ca38 [kernel].text.unlikely
c000000000879b6c-c000000000bf0000 889b6c [kernel].meminit.text
c000000000bf0000-c000000000c53264 c00000 [kernel].init.text
c000000000c53264-d000000004250000 c63264 [kernel].exit.text
d000000004250000-d000000004450000 0 [libcrc32c]
d000000004450000-d000000004620000 0 [xfs]
d000000004620000-d000000004680000 0 [autofs4]
d000000004680000-d0000000046e0000 0 [x_tables]
d0000000046e0000-d000000004780000 0 [ip_tables]
d000000004780000-d0000000047e0000 0 [rng_core]
d0000000047e0000-ffffffffffffffff 0 [pseries_rng]
Maps in vmlinux with a different name in kallsyms:
Maps only in kallsyms:
d000000000000000-f000000000000000 1000000000010000 [kernel.kallsyms]
f000000000000000-ffffffffffffffff 3000000000010000 [kernel.kallsyms]
test child finished with -1
---- end ----
vmlinux symtab matches kallsyms: FAILED!
The problem is that the kretprobe_trampoline symbol looks like this:
$ eu-readelf -s /boot/vmlinux G kretprobe_trampoline
2431: c000000001302368 24 NOTYPE LOCAL DEFAULT 37 kretprobe_trampoline_holder
2432: c00000000003d300 8 FUNC LOCAL DEFAULT 1 .kretprobe_trampoline_holder
97543: c00000000003d300 0 NOTYPE GLOBAL DEFAULT 1 kretprobe_trampoline
Its type is NOTYPE, and its size is 0, and this is a problem because
symbol-elf.c:dso__load_sym skips function symbols that are not STT_FUNC
or STT_GNU_IFUNC (this is determined by elf_sym__is_function). Even
if the type is changed to STT_FUNC, when dso__load_sym calls
symbols__fixup_duplicate, the kretprobe_trampoline symbol is dropped in
favour of .kretprobe_trampoline_holder because the latter has non-zero
size (as determined by choose_best_symbol).
With this patch, all vmlinux symbols match /proc/kallsyms and the
testcase passes.
Commit c1c355ce14c0 ("x86/kprobes: Get rid of
kretprobe_trampoline_holder()") gets rid of kretprobe_trampoline_holder
altogether on x86. This commit does the same on powerpc. This change
introduces no regressions on the perf and ftracetest testsuite results.
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When a guest is assigned to a core it converts the host Timebase (TB)
into guest TB by adding guest timebase offset before entering into
guest. During guest exit it restores the guest TB to host TB. This means
under certain conditions (Guest migration) host TB and guest TB can differ.
When we get an HMI for TB related issues the opal HMI handler would
try fixing errors and restore the correct host TB value. With no guest
running, we don't have any issues. But with guest running on the core
we run into TB corruption issues.
If we get an HMI while in the guest, the current HMI handler invokes opal
hmi handler before forcing guest to exit. The guest exit path subtracts
the guest TB offset from the current TB value which may have already
been restored with host value by opal hmi handler. This leads to incorrect
host and guest TB values.
With split-core, things become more complex. With split-core, TB also gets
split and each subcore gets its own TB register. When a hmi handler fixes
a TB error and restores the TB value, it affects all the TB values of
sibling subcores on the same core. On TB errors all the thread in the core
gets HMI. With existing code, the individual threads call opal hmi handle
independently which can easily throw TB out of sync if we have guest
running on subcores. Hence we will need to co-ordinate with all the
threads before making opal hmi handler call followed by TB resync.
This patch introduces a sibling subcore state structure (shared by all
threads in the core) in paca which holds information about whether sibling
subcores are in Guest mode or host mode. An array in_guest[] of size
MAX_SUBCORE_PER_CORE=4 is used to maintain the state of each subcore.
The subcore id is used as index into in_guest[] array. Only primary
thread entering/exiting the guest is responsible to set/unset its
designated array element.
On TB error, we get HMI interrupt on every thread on the core. Upon HMI,
this patch will now force guest to vacate the core/subcore. Primary
thread from each subcore will then turn off its respective bit
from the above bitmap during the guest exit path just after the
guest->host partition switch is complete.
All other threads that have just exited the guest OR were already in host
will wait until all other subcores clears their respective bit.
Once all the subcores turn off their respective bit, all threads will
will make call to opal hmi handler.
It is not necessary that opal hmi handler would resync the TB value for
every HMI interrupts. It would do so only for the HMI caused due to
TB errors. For rest, it would not touch TB value. Hence to make things
simpler, primary thread would call TB resync explicitly once for each
core immediately after opal hmi handler instead of subtracting guest
offset from TB. TB resync call will restore the TB with host value.
Thus we can be sure about the TB state.
One of the primary threads exiting the guest will take up the
responsibility of calling TB resync. It will use one of the top bits
(bit 63) from subcore state flags bitmap to make the decision. The first
primary thread (among the subcores) that is able to set the bit will
have to call the TB resync. Rest all other threads will wait until TB
resync is complete. Once TB resync is complete all threads will then
proceed.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
"User" addresses are shown in /sys/devices/pci.../.../resource and
/proc/bus/pci/devices and used as mmap offsets for /proc/bus/pci/BB/DD.F
files. For I/O port resources on powerpc, these are PCI bus addresses,
i.e., raw BAR values.
Previously pci_resource_to_user() computed the user address by subtracting
"hose->io_base_virt - _IO_BASE" from the resource start:
pci_resource_to_user()
if (IO)
offset = (unsigned long)hose->io_base_virt - _IO_BASE;
*start = rsrc->start - offset;
We've already told the PCI core about that "hose->io_base_virt - _IO_BASE"
offset:
pcibios_setup_phb_resources()
res = &hose->io_resource;
offset = pcibios_io_space_offset();
/* i.e., "offset = hose->io_base_virt - _IO_BASE" */
pci_add_resource_offset(resources, res, offset);
so pcibios_resource_to_bus() knows how to do that translation.
No functional change intended.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>