IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Userland passes an array of 64 SLB descriptors to KVM_SET_SREGS,
some of which are valid (ie, SLB_ESID_V is set) and the rest are
likely all-zeroes (with QEMU at least).
Each of them is then passed to kvmppc_mmu_book3s_64_slbmte(), which
assumes to find the SLB index in the 3 lower bits of its rb argument.
When passed zeroed arguments, it happily overwrites the 0th SLB entry
with zeroes. This is exactly what happens while doing live migration
with QEMU when the destination pushes the incoming SLB descriptors to
KVM PR. When reloading the SLBs at the next synchronization, QEMU first
clears its SLB array and only restore valid ones, but the 0th one is
now gone and we cannot access the corresponding memory anymore:
(qemu) x/x $pc
c0000000000b742c: Cannot access memory
To avoid this, let's filter out non-valid SLB entries. While here, we
also force a full SLB flush before installing new entries. Since SLB
is for 64-bit only, we now build this path conditionally to avoid a
build break on 32-bit, which doesn't define SLB_ESID_V.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When running a guest on a POWER9 system with the in-kernel XICS
emulation disabled (for example by running QEMU with the parameter
"-machine pseries,kernel_irqchip=off"), the kernel does not pass
the XICS-related hypercalls such as H_CPPR up to userspace for
emulation there as it should.
The reason for this is that the real-mode handlers for these
hypercalls don't check whether a XICS device has been instantiated
before calling the xics-on-xive code. That code doesn't check
either, leading to potential NULL pointer dereferences because
vcpu->arch.xive_vcpu is NULL. Those dereferences won't cause an
exception in real mode but will lead to kernel memory corruption.
This fixes it by adding kvmppc_xics_enabled() checks before calling
the XICS functions.
Cc: stable@vger.kernel.org # v4.11+
Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Several function prototypes for the set/get functions defined by
module_param_call() have a slightly wrong argument types. This fixes
those in an effort to clean up the calls when running under type-enforced
compiler instrumentation for CFI. This is the result of running the
following semantic patch:
@match_module_param_call_function@
declarer name module_param_call;
identifier _name, _set_func, _get_func;
expression _arg, _mode;
@@
module_param_call(_name, _set_func, _get_func, _arg, _mode);
@fix_set_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._set_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _set_func(
-_val_type _val
+const char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
@fix_get_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._get_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _get_func(
-_val_type _val
+char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
Two additional by-hand changes are included for places where the above
Coccinelle script didn't notice them:
drivers/platform/x86/thinkpad_acpi.c
fs/lockd/svc.c
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Fixes: 424de9c6e3f8 ("powerpc/mm/radix: Avoid flushing the PWC on every flush_tlb_range")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 07d2a628bc00 ("powerpc/64s: Avoid cpabort in context switch
when possible", 2017-06-09) changed the definition of PPC_INST_COPY
and in so doing inadvertently broke the check for copy/paste
instructions in the alignment fault handler. The check currently
matches no instructions.
This fixes it by ANDing both sides of the comparison with the mask.
Fixes: 07d2a628bc00 ("powerpc/64s: Avoid cpabort in context switch when possible")
Cc: stable@vger.kernel.org # v4.13+
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When setting nr_cpus=1, we observed a crash in IMC code during boot
due to a missing allocation: basically, IMC code is taking the number
of threads into account in imc_mem_init() and if we manually set
nr_cpus for a value that is not multiple of the number of threads per
core, an integer division in that function will discard the decimal
portion, leading IMC to not allocate one mem_info struct. This causes
a NULL pointer dereference later, on is_core_imc_mem_inited().
This patch just rounds that division up, fixing the bug.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Acked-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Although kfree(NULL) is legal, it's a bit lazy to rely on that to
implement the error handling. So do it the normal Linux way using
labels for each failure path.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[mpe: Squash a few patches and rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The local variable "rc" will eventually be set only to an error code.
Thus omit the explicit initialisation at the beginning.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In the hv-24x7 code there is a function memord() which tries to
implement a sort function return -1, 0, 1. However one of the
conditions is incorrect, such that it can never be true, because we
will have already returned.
I don't believe there is a bug in practice though, because the
comparisons are an optimisation prior to calling memcmp().
Fix it by swapping the second comparision, so it can be true.
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Back in 2008 we added support for "fast little-endian switch" in the
syscall path. This added a special case syscall number 0x1ebe, which
is caught very early in the system call exception and switches endian
with as little overhead as possible. See commit 745a14cc264b
("[POWERPC] Add fast little-endian switch system call") for full
details.
Although it is fast, it's also completely non standard. The "syscall
number" is out of the range of normal syscalls, it can't be traced or
audited, and it's a bit of a wart. To the best of our knowledge it was
only used by one program, now long since discontinued.
So in an effort to shake out any current users, put it behind a config
option, and make it default n. If anyone *is* using it they can
quickly reinstate it with a rebuild, and we can flip it to default y.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When dumping the paca in xmon we currently show kstack. Although it's
not hard it's a bit fiddly to work out what the bounds of the kernel
stack should be based on the kstack value.
To make life easier and "kstack_base" which is the base (lowest
address) of the kernel stack, eg:
kstack = 0xc0000000f1a7be30 (0x258)
kstack_base = 0xc0000000f1a78000
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
i2c-dev provides an interface for userspace programs to interact with I2C
devices, and is very helpful for I2C-related debugging.
Enable it in pseries_defconfig and powernv_defconfig. It's already enabled
in many other powerpc defconfigs.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We call these functions with non-NULL mm or vma. Hence we can skip the
NULL check in these functions. We also remove now unused function
__local_flush_hugetlb_page().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Drop the checks with is_vm_hugetlb_page() as noticed by Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently xmon could call XIVE functions from OPAL even if the XIVE is
disabled or does not exist in the system, as in POWER8 machines. This
causes the following exception:
1:mon> dx
cpu 0x1: Vector: 700 (Program Check) at [c000000423c93450]
pc: c00000000009cfa4: opal_xive_dump+0x50/0x68
lr: c0000000000997b8: opal_return+0x0/0x50
This patch simply checks if XIVE is enabled before calling XIVE
functions.
Fixes: 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
Suggested-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Unfortunately userspace can construct a sigcontext which enables
suspend. Thus userspace can force Linux into a path where trechkpt is
executed.
This patch blocks this from happening on POWER9 by sanity checking
sigcontexts passed in.
ptrace doesn't have this problem as only MSR SE and BE can be changed
via ptrace.
This patch also adds a number of WARN_ON()s in case we ever enter
suspend when we shouldn't. This should not happen, but if it does the
symptoms are soft lockup warnings which are not obviously TM related,
so the WARN_ON()s should make it obvious what's happening.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some Power9 revisions can run in a mode where TM operates without
suspended state. If we find ourself on a CPU that might be in this
mode, we query OPAL to check, and if so we reenable TM in CPU
features, and enable a new user feature to signal to userspace that we
are in this mode.
We do not enable the "normal" user feature, PPC_FEATURE2_HTM, but we
do enable PPC_FEATURE2_HTM_NOSC because that indicates to userspace
that the kernel will abort transactions on syscall entry, which is
true regardless of the suspend mode.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some CPUs can operate in a mode where TM (Transactional Memory) is
enabled but the suspended state of TM is disabled. In this mode
tsuspend does not enter suspended state, instead the transaction is
aborted. Similarly any other event that would lead to suspended state
instead aborts the transaction.
There is also an ABI change, in that in this mode processes are not
allowed to sigreturn with an MSR that would lead to suspended state,
Linux will instead return an error to the sigreturn syscall.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the kernel relies on firmware to inform it whether or not the
CPU supports HTM and as long as the kernel was built with
CONFIG_PPC_TRANSACTIONAL_MEM=y then it will allow userspace to make
use of the facility.
There may be situations where it would be advantageous for the kernel
to not allow userspace to use HTM, currently the only way to achieve
this is to recompile the kernel with CONFIG_PPC_TRANSACTIONAL_MEM=n.
This patch adds a simple commandline option so that HTM can be
disabled at boot time.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
[mpe: Simplify to a bool, move to prom.c, put doco in the right place.
Always disable, regardless of initial state, to avoid user confusion.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we use CPU_FTR_TM to decide if the CPU/kernel can support
TM (Transactional Memory), and if it's true we advertise that to
Qemu (or similar) via KVM_CAP_PPC_HTM.
PPC_FEATURE2_HTM is the user-visible feature bit, which indicates that
the CPU and kernel can support TM. Currently CPU_FTR_TM and
PPC_FEATURE2_HTM always have the same value, either true or false, so
using the former for KVM_CAP_PPC_HTM is correct.
However some Power9 CPUs can operate in a mode where TM is enabled but
TM suspended state is disabled. In this mode CPU_FTR_TM is true, but
PPC_FEATURE2_HTM is false. Instead a different PPC_FEATURE2 bit is
set, to indicate that this different mode of TM is available.
It is not safe to let guests use TM as-is, when the CPU is in this
mode. So to prevent that from happening, use PPC_FEATURE2_HTM to
determine the value of KVM_CAP_PPC_HTM.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
After we removed all the dead wood it turns out only two architectures
actually implement dma_cache_sync as a real op: mips and parisc. Add
a cache_sync method to struct dma_map_ops and implement it for the
mips defualt DMA ops, and the parisc pa11 ops.
Note that arm, arc and openrisc support DMA_ATTR_NON_CONSISTENT, but
never provided a functional dma_cache_sync implementations, which
seems somewhat odd.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
powerpc does not implement DMA_ATTR_NON_CONSISTENT allocations, so it
doesn't make any sense to do any work in dma_cache_sync given that it
must be a no-op when dma_alloc_attrs returns coherent memory.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Only mips defines this helper, so remove all the other arch definitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
This reverts commit 94a04bc25a2c6296bd0c5e82c10e8231c2b11f77.
In order to run HPT guests on a radix POWER9 host, we will have to run
the host in single-threaded mode, because POWER9 processors do not
currently support running some threads of a core in HPT mode while
others are in radix mode ("mixed mode").
That means that we will need the same mechanisms that are used on
POWER8 to make the secondary threads available to KVM, which were
disabled on POWER9 by commit 94a04bc25a2c.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
arch_spin_lock_flags() is an internal part of the spinlock implementation
and is no longer available when SMP=n and DEBUG_SPINLOCK=y, so the PPC
RTAS code fails to compile in this configuration:
arch/powerpc/kernel/rtas.c: In function 'lock_rtas':
>> arch/powerpc/kernel/rtas.c:81:2: error: implicit declaration of function 'arch_spin_lock_flags' [-Werror=implicit-function-declaration]
arch_spin_lock_flags(&rtas.lock, flags);
^~~~~~~~~~~~~~~~~~~~
Since there's no good reason to use arch_spin_lock_flags() here (the code
in question already calls local_irq_save(flags)), switch it over to
arch_spin_lock and get things building again.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508327469-20231-1-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Axtens reported that on the HiSilicon D05 board, the VGA device is
behind a bridge that doesn't support PCI_BRIDGE_CTL_VGA, so the VGA arbiter
never selects it as the default, which means Xorg auto-detection doesn't
work.
VGA is a legacy PCI feature: a VGA device can respond to addresses, e.g.,
[mem 0xa0000-0xbffff], [io 0x3b0-0x3bb], [io 0x3c0-0x3df], etc., that are
not configurable by BARs. Consequently, multiple VGA devices can conflict
with each other. The VGA arbiter avoids conflicts by ensuring that those
legacy resources are only routed to one VGA device at a time.
The arbiter identifies the "default VGA" device, i.e., a legacy VGA device
that was used by boot firmware. It selects the first device that:
- is of PCI_CLASS_DISPLAY_VGA,
- has both PCI_COMMAND_IO and PCI_COMMAND_MEMORY enabled, and
- has PCI_BRIDGE_CTL_VGA set in all upstream bridges.
Some systems don't have such a device. For example, if a host bridge
doesn't support I/O space, PCI_COMMAND_IO probably won't be enabled for any
devices below it. Or, as on the HiSilicon D05, the VGA device may be
behind a bridge that doesn't support PCI_BRIDGE_CTL_VGA, so accesses to the
legacy VGA resources will never reach the device.
This patch extends the arbiter so that if it doesn't find a device that
meets all the above criteria, it selects the first device that:
- is of PCI_CLASS_DISPLAY_VGA and
- has PCI_COMMAND_IO or PCI_COMMAND_MEMORY enabled
If it doesn't find even that, it selects the first device that:
- is of class PCI_CLASS_DISPLAY_VGA.
Such a device may not be able to use the legacy VGA resources, but most
drivers can operate the device without those. Setting it as the default
device means its "boot_vga" sysfs file will contain "1", which Xorg (via
libpciaccess) uses to help select its default output device.
This fixes Xorg auto-detection on some arm64 systems (HiSilicon D05 in
particular; see the link below).
It also replaces the powerpc fixup_vga() quirk, albeit with slightly
different semantics: the quirk selected the first VGA device we found, and
overrode that selection with any enabled VGA device we found. If there
were several enabled VGA devices, the *last* one we found would become the
default.
The code here instead selects the *first* enabled VGA device we find, and
if none are enabled, the first VGA device we find.
Link: http://lkml.kernel.org/r/20170901072744.2409-1-dja@axtens.net
Tested-by: Daniel Axtens <dja@axtens.net> # arm64, ppc64-qemu-tcg
Tested-by: Zhou Wang <wangzhou1@hisilicon.com> # D05 Hisi Hip07, Hip08
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20171013034721.14630.65913.stgit@bhelgaas-glaptop.roam.corp.google.com
powerpc/vphn: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources. This patch
fixes an end-of-updates processing problem observed occasionally
in numa_update_cpu_topology().
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/hotplug: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources. During hotplug
CPU operations, this patch resets the timer on topology update work
function to a small value to better ensure that the CPU topology is
detected and configured sooner.
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/vphn: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources. This patch
updates the initialization checks to independently recognize PRRN
or VPHN support.
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/vphn: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources. This patch
corrects the currently broken capability to set the topology for
shared CPUs in LPARs. At boot time for shared CPU lpars, the
topology for each CPU was being set to node zero. Now when
numa_update_cpu_topology() is called appropriately, the Virtual
Processor Home Node (VPHN) capabilities information provided by the
pHyp allows the appropriate node in the shared configuration to be
selected for the CPU.
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If we are in user space and hit a UE error, we now have the
basic infrastructure to walk the page tables and find out
the effective address that was accessed, since the DAR
is not valid.
We use a work_queue content to hookup the bad pfn, any
other context causes problems, since memory_failure itself
can call into schedule() via lru_drain_ bits.
We could probably poison the struct page to avoid a race
between detection and taking corrective action.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Hookup instruction errors (UE) for memory offling via memory_failure()
in a manner similar to load/store errors (derror). Since we have access
to the NIP, the conversion is a one step process in this case.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Extract physical_address for UE errors by walking the page
tables for the mm and address at the NIP, to extract the
instruction. Then use the instruction to find the effective
address via analyse_instr().
We might have page table walking races, but we expect them to
be rare, the physical address extraction is best effort. The idea
is to then hook up this infrastructure to memory failure eventually.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use the same alignment as Effective address.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There are no users of get_mce_fault_addr() since commit
1363875bdb63 ("powerpc/64s: fix handling of non-synchronous machine
checks") removed the last usage.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On POWER9 systems, we push the VCPU context onto the XIVE (eXternal
Interrupt Virtualization Engine) hardware when entering a guest,
and pull the context off the XIVE when exiting the guest. The push
is done with cache-inhibited stores, and the pull with cache-inhibited
loads.
Testing has revealed that it is possible (though very rare) for
the stores to get reordered with the loads so that we end up with the
guest VCPU context still loaded on the XIVE after we have exited the
guest. When that happens, it is possible for the same VCPU context
to then get loaded on another CPU, which causes the machine to
checkstop.
To fix this, we add I/O barrier instructions (eieio) before and
after the push and pull operations. As partial compensation for the
potential slowdown caused by the extra barriers, we remove the eieio
instructions between the two stores in the push operation, and between
the two loads in the pull operation. (The architecture requires
loads to cache-inhibited, guarded storage to be kept in order, and
requires stores to cache-inhibited, guarded storage likewise to be
kept in order, but allows such loads and stores to be reordered with
respect to each other.)
Reported-by: Carol L Soto <clsoto@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to make sure that we don't try to access the
non-existent HPT for a radix guest using the htab file for the VM
in debugfs, a file descriptor obtained using the KVM_PPC_GET_HTAB_FD
ioctl, or via the KVM_PPC_RESIZE_HPT_{PREPARE,COMMIT} ioctls.
At present nothing bad happens if userspace does access these
interfaces on a radix guest, mostly because kvmppc_hpt_npte()
gives 0 for a radix guest, which in turn is because 1 << -4
comes out as 0 on POWER processors. However, that relies on
undefined behaviour, so it is better to be explicit about not
accessing the HPT for a radix guest.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The handlers support PR KVM from the day one; however the PR KVM's
enable/disable hcalls handler missed these ones.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Omit an extra message for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Use vma_pages function on vma object instead of explicit computation.
Found by coccinelle spatch "api/vma_pages.cocci"
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Use ARRAY_SIZE macro, rather than explicitly coding some variant of it
yourself.
Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e
's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\)
/ARRAY_SIZE(\1)/g' and manual check/verification.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At present, if an interrupt (i.e. an exception or trap) occurs in the
code where KVM is switching the MMU to or from guest context, we jump
to kvmppc_bad_host_intr, where we simply spin with interrupts disabled.
In this situation, it is hard to debug what happened because we get no
indication as to which interrupt occurred or where. Typically we get
a cascade of stall and soft lockup warnings from other CPUs.
In order to get more information for debugging, this adds code to
create a stack frame on the emergency stack and save register values
to it. We start half-way down the emergency stack in order to give
ourselves some chance of being able to do a stack trace on secondary
threads that are already on the emergency stack.
On POWER7 or POWER8, we then just spin, as before, because we don't
know what state the MMU context is in or what other threads are doing,
and we can't switch back to host context without coordinating with
other threads. On POWER9 we can do better; there we load up the host
MMU context and jump to C code, which prints an oops message to the
console and panics.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
kvmppc_gpa_to_ua() accesses KVM memory slot array via
srcu_dereference_check() and this produces warnings from RCU like below.
This extends the existing srcu_read_lock/unlock to cover that
kvmppc_gpa_to_ua() as well.
We did not hit this before as this lock is not needed for the realmode
handlers and hash guests would use the realmode path all the time;
however the radix guests are always redirected to the virtual mode
handlers and hence the warning.
[ 68.253798] ./include/linux/kvm_host.h:575 suspicious rcu_dereference_check() usage!
[ 68.253799]
other info that might help us debug this:
[ 68.253802]
rcu_scheduler_active = 2, debug_locks = 1
[ 68.253804] 1 lock held by qemu-system-ppc/6413:
[ 68.253806] #0: (&vcpu->mutex){+.+.}, at: [<c00800000e3c22f4>] vcpu_load+0x3c/0xc0 [kvm]
[ 68.253826]
stack backtrace:
[ 68.253830] CPU: 92 PID: 6413 Comm: qemu-system-ppc Tainted: G W 4.14.0-rc3-00553-g432dcba58e9c-dirty #72
[ 68.253833] Call Trace:
[ 68.253839] [c000000fd3d9f790] [c000000000b7fcc8] dump_stack+0xe8/0x160 (unreliable)
[ 68.253845] [c000000fd3d9f7d0] [c0000000001924c0] lockdep_rcu_suspicious+0x110/0x180
[ 68.253851] [c000000fd3d9f850] [c0000000000e825c] kvmppc_gpa_to_ua+0x26c/0x2b0
[ 68.253858] [c000000fd3d9f8b0] [c00800000e3e1984] kvmppc_h_put_tce+0x12c/0x2a0 [kvm]
Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
- Add another case where msgsync is required.
- Required barrier sequence for global doorbells is msgsync ; lwsync
When msgsnd is used for IPIs to other cores, msgsync must be executed by
the target to order stores performed on the source before its msgsnd
(provided the source executes the appropriate sync).
Fixes: 1704a81ccebc ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>