IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
In preparation to make the check_extension function available to VM scope
we add a struct kvm * argument to the function header and rename the function
accordingly. It will still be called from the /dev/kvm fd, but with a NULL
argument for struct kvm *.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
The POWER8 processor has a Micro Partition Prefetch Engine, which is
a fancy way of saying "has way to store and load contents of L2 or
L2+MRU way of L3 cache". We initiate the storing of the log (list of
addresses) using the logmpp instruction and start restore by writing
to a SPR.
The logmpp instruction takes parameters in a single 64bit register:
- starting address of the table to store log of L2/L2+L3 cache contents
- 32kb for L2
- 128kb for L2+L3
- Aligned relative to maximum size of the table (32kb or 128kb)
- Log control (no-op, L2 only, L2 and L3, abort logout)
We should abort any ongoing logging before initiating one.
To initiate restore, we write to the MPPR SPR. The format of what to write
to the SPR is similar to the logmpp instruction parameter:
- starting address of the table to read from (same alignment requirements)
- table size (no data, until end of table)
- prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
32 cycles)
The idea behind loading and storing the contents of L2/L3 cache is to
reduce memory latency in a system that is frequently swapping vcores on
a physical CPU.
The best case scenario for doing this is when some vcores are doing very
cache heavy workloads. The worst case is when they have about 0 cache hits,
so we just generate needless memory operations.
This implementation just does L2 store/load. In my benchmarks this proves
to be useful.
Benchmark 1:
- 16 core POWER8
- 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
- No split core/SMT
- two guests running sysbench memory test.
sysbench --test=memory --num-threads=8 run
- one guest running apache bench (of default HTML page)
ab -n 490000 -c 400 http://localhost/
This benchmark aims to measure performance of real world application (apache)
where other guests are cache hot with their own workloads. The sysbench memory
benchmark does pointer sized writes to a (small) memory buffer in a loop.
In this benchmark with this patch I can see an improvement both in requests
per second (~5%) and in mean and median response times (again, about 5%).
The spread of minimum and maximum response times were largely unchanged.
benchmark 2:
- Same VM config as benchmark 1
- all three guests running sysbench memory benchmark
This benchmark aims to see if there is a positive or negative affect to this
cache heavy benchmark. Although due to the nature of the benchmark (stores) we
may not see a difference in performance, but rather hopefully an improvement
in consistency of performance (when vcore switched in, don't have to wait
many times for cachelines to be pulled in)
The results of this benchmark are improvements in consistency of performance
rather than performance itself. With this patch, the few outliers in duration
go away and we get more consistent performance in each guest.
benchmark 3:
- same 3 guests and CPU configuration as benchmark 1 and 2.
- two idle guests
- 1 guest running STREAM benchmark
This scenario also saw performance improvement with this patch. On Copy and
Scale workloads from STREAM, I got 5-6% improvement with this patch. For
Add and triad, it was around 10% (or more).
benchmark 4:
- same 3 guests as previous benchmarks
- two guests running sysbench --memory, distinctly different cache heavy
workload
- one guest running STREAM benchmark.
Similar improvements to benchmark 3.
benchmark 5:
- 1 guest, 8 VCPUs, Ubuntu 14.04
- Host configured with split core (SMT8, subcores-per-core=4)
- STREAM benchmark
In this benchmark, we see a 10-20% performance improvement across the board
of STREAM benchmark results with this patch.
Based on preliminary investigation and microbenchmarks
by Prerna Saxena <prerna@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
No code changes, just split it out to a function so that with the addition
of micro partition prefetch buffer allocation (in subsequent patch) looks
neater and doesn't require excessive indentation.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
At present, kvmppc_ld calls kvmppc_xlate, and if kvmppc_xlate returns
any error indication, it returns -ENOENT, which is taken to mean an
HPTE not found error. However, the error could have been a segment
found (no SLB entry) or a permission error. Similarly,
kvmppc_pte_to_hva currently does permission checking, but any error
from it is taken by kvmppc_ld to mean that the access is an emulated
MMIO access. Also, kvmppc_ld does no execute permission checking.
This fixes these problems by (a) returning any error from kvmppc_xlate
directly, (b) moving the permission check from kvmppc_pte_to_hva
into kvmppc_ld, and (c) adding an execute permission check to kvmppc_ld.
This is similar to what was done for kvmppc_st() by commit 82ff911317c3
("KVM: PPC: Deflect page write faults properly in kvmppc_st").
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This does for PR KVM what c9438092cae4 ("KVM: PPC: Book3S HV: Take SRCU
read lock around kvm_read_guest() call") did for HV KVM, that is,
eliminate a "suspicious rcu_dereference_check() usage!" warning by
taking the SRCU lock around the call to kvmppc_rtas_hcall().
It also fixes a return of RESUME_HOST to return EMULATE_FAIL instead,
since kvmppc_h_pr() is supposed to return EMULATE_* values.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Graf <agraf@suse.de>
Unfortunately, the LPCR got defined as a 32-bit register in the
one_reg interface. This is unfortunate because KVM allows userspace
to control the DPFD (default prefetch depth) field, which is in the
upper 32 bits. The result is that DPFD always get set to 0, which
reduces performance in the guest.
We can't just change KVM_REG_PPC_LPCR to be a 64-bit register ID,
since that would break existing userspace binaries. Instead we define
a new KVM_REG_PPC_LPCR_64 id which is 64-bit. Userspace can still use
the old KVM_REG_PPC_LPCR id, but it now only modifies those fields in
the bottom 32 bits that userspace can modify (ILE, TC and AIL).
If userspace uses the new KVM_REG_PPC_LPCR_64 id, it can modify DPFD
as well.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Graf <agraf@suse.de>
The 440 target hasn't been properly functioning for a few releases and
before I was the only one who fixes a very serious bug that indicates to
me that nobody used it before either.
Furthermore KVM on 440 is slow to the extent of unusable.
We don't have to carry along completely unused code. Remove 440 and give
us one less thing to worry about.
Signed-off-by: Alexander Graf <agraf@suse.de>
We now support SPRG9 for guest, so also add a one reg interface for same
Note: Changes are in bookehv code only as we do not have SPRG9 on booke-pr.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
SPRN_SPRG is used by debug interrupt handler, so this is required for
debug support.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
On book3e, KVM uses load external pid (lwepx) dedicated instruction to read
guest last instruction on the exit path. lwepx exceptions (DTLB_MISS, DSI
and LRAT), generated by loading a guest address, needs to be handled by KVM.
These exceptions are generated in a substituted guest translation context
(EPLC[EGS] = 1) from host context (MSR[GS] = 0).
Currently, KVM hooks only interrupts generated from guest context (MSR[GS] = 1),
doing minimal checks on the fast path to avoid host performance degradation.
lwepx exceptions originate from host state (MSR[GS] = 0) which implies
additional checks in DO_KVM macro (beside the current MSR[GS] = 1) by looking
at the Exception Syndrome Register (ESR[EPID]) and the External PID Load Context
Register (EPLC[EGS]). Doing this on each Data TLB miss exception is obvious
too intrusive for the host.
Read guest last instruction from kvmppc_load_last_inst() by searching for the
physical address and kmap it. This address the TODO for TLB eviction and
execute-but-not-read entries, and allow us to get rid of lwepx until we are
able to handle failures.
A simple stress benchmark shows a 1% sys performance degradation compared with
previous approach (lwepx without failure handling):
time for i in `seq 1 10000`; do /bin/echo > /dev/null; done
real 0m 8.85s
user 0m 4.34s
sys 0m 4.48s
vs
real 0m 8.84s
user 0m 4.36s
sys 0m 4.44s
A solution to use lwepx and to handle its exceptions in KVM would be to temporary
highjack the interrupt vector from host. This imposes additional synchronizations
for cores like FSL e6500 that shares host IVOR registers between hardware threads.
This optimized solution can be later developed on top of this patch.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
On book3e, guest last instruction is read on the exit path using load
external pid (lwepx) dedicated instruction. This load operation may fail
due to TLB eviction and execute-but-not-read entries.
This patch lay down the path for an alternative solution to read the guest
last instruction, by allowing kvmppc_get_lat_inst() function to fail.
Architecture specific implmentations of kvmppc_load_last_inst() may read
last guest instruction and instruct the emulation layer to re-execute the
guest in case of failure.
Make kvmppc_get_last_inst() definition common between architectures.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
In the context of replacing kvmppc_ld() function calls with a version of
kvmppc_get_last_inst() which allow to fail, Alex Graf suggested this:
"If we get EMULATE_AGAIN, we just have to make sure we go back into the guest.
No need to inject an ISI into the guest - it'll do that all by itself.
With an error returning kvmppc_get_last_inst we can just use completely
get rid of kvmppc_read_inst() and only use kvmppc_get_last_inst() instead."
As a intermediate step get rid of kvmppc_read_inst() and only use kvmppc_ld()
instead.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
The commit 1d628af7 "add load inst fixup" made an attempt to handle
failures generated by reading the guest current instruction. The fixup
code that was added works by chance hiding the real issue.
Load external pid (lwepx) instruction, used by KVM to read guest
instructions, is executed in a subsituted guest translation context
(EPLC[EGS] = 1). In consequence lwepx's TLB error and data storage
interrupts need to be handled by KVM, even though these interrupts
are generated from host context (MSR[GS] = 0) where lwepx is executed.
Currently, KVM hooks only interrupts generated from guest context
(MSR[GS] = 1), doing minimal checks on the fast path to avoid host
performance degradation. As a result, the host kernel handles lwepx
faults searching the faulting guest data address (loaded in DEAR) in
its own Logical Partition ID (LPID) 0 context. In case a host translation
is found the execution returns to the lwepx instruction instead of the
fixup, the host ending up in an infinite loop.
Revert the commit "add load inst fixup". lwepx issue will be addressed
in a subsequent patch without needing fixup code.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
kvmppc_set_epr() is already defined in asm/kvm_ppc.h, So
rename and move get_epr helper function to same file.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
[agraf: remove duplicate return]
Signed-off-by: Alexander Graf <agraf@suse.de>
Use kvmppc_set_sprg[0-7]() and kvmppc_get_sprg[0-7]() helper
functions
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Add and use kvmppc_set_esr() and kvmppc_get_esr() helper functions
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Use kvmppc_set_srr0/srr1() and kvmppc_get_srr0/srr1() helper functions
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
The magic page is defined as a 4k page of per-vCPU data that is shared
between the guest and the host to accelerate accesses to privileged
registers.
However, when the host is using 64k page size granularity we weren't quite
as strict about that rule anymore. Instead, we partially treated all of the
upper 64k as magic page and mapped only the uppermost 4k with the actual
magic contents.
This works well enough for Linux which doesn't use any memory in kernel
space in the upper 64k, but Mac OS X got upset. So this patch makes magic
page actually stay in a 4k range even on 64k page size hosts.
This patch fixes magic page usage with Mac OS X (using MOL) on 64k PAGE_SIZE
hosts for me.
Signed-off-by: Alexander Graf <agraf@suse.de>
Today we handle split real mode by mapping both instruction and data faults
into a special virtual address space that only exists during the split mode
phase.
This is good enough to catch 32bit Linux guests that use split real mode for
copy_from/to_user. In this case we're always prefixed with 0xc0000000 for our
instruction pointer and can map the user space process freely below there.
However, that approach fails when we're running KVM inside of KVM. Here the 1st
level last_inst reader may well be in the same virtual page as a 2nd level
interrupt handler.
It also fails when running Mac OS X guests. Here we have a 4G/4G split, so a
kernel copy_from/to_user implementation can easily overlap with user space
addresses.
The architecturally correct way to fix this would be to implement an instruction
interpreter in KVM that kicks in whenever we go into split real mode. This
interpreter however would not receive a great amount of testing and be a lot of
bloat for a reasonably isolated corner case.
So I went back to the drawing board and tried to come up with a way to make
split real mode work with a single flat address space. And then I realized that
we could get away with the same trick that makes it work for Linux:
Whenever we see an instruction address during split real mode that may collide,
we just move it higher up the virtual address space to a place that hopefully
does not collide (keep your fingers crossed!).
That approach does work surprisingly well. I am able to successfully run
Mac OS X guests with KVM and QEMU (no split real mode hacks like MOL) when I
apply a tiny timing probe hack to QEMU. I'd say this is a win over even more
broken split real mode :).
Signed-off-by: Alexander Graf <agraf@suse.de>
When a page lookup failed because we're not allowed to write to the page, we
should not overwrite that value with another lookup on the second PTEG which
will return "page not found". Instead, we should just tell the caller that we
had a permission problem.
This fixes Mac OS X guests looping endlessly in page lookup code for me.
Signed-off-by: Alexander Graf <agraf@suse.de>
When we have a page that we're not allowed to write to, xlate() will already
tell us -EPERM on lookup of that page. With the code as is we change it into
a "page missing" error which a guest may get confused about. Instead, just
tell the caller about the -EPERM directly.
This fixes Mac OS X guests when run with DCBZ32 emulation.
Signed-off-by: Alexander Graf <agraf@suse.de>
For FSL e6500 core the kernel uses power management SPR register (PWRMGTCR0)
to enable idle power down for cores and devices by setting up the idle count
period at boot time. With the host already controlling the power management
configuration the guest could simply benefit from it, so emulate guest request
as a general store.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Now that we've fixed all the issues that HV KVM code had on little endian
hosts, we can enable it in the kernel configuration for users to play with.
Signed-off-by: Alexander Graf <agraf@suse.de>
For code that doesn't live in modules we can just branch to the real function
names, giving us compatibility with ABIv1 and ABIv2.
Do this for the compiled-in code of HV KVM.
Signed-off-by: Alexander Graf <agraf@suse.de>
On the exit path from the guest we check what type of interrupt we received
if we received one. This means we're doing hardware access to the XICS interrupt
controller.
However, when running on a little endian system, this access is byte reversed.
So let's make sure to swizzle the bytes back again and virtually make XICS
accesses big endian.
Signed-off-by: Alexander Graf <agraf@suse.de>
Some data structures are always stored in big endian. Among those are the LPPACA
fields as well as the shadow slb. These structures might be shared with a
hypervisor.
So whenever we access those fields, make sure we do so in big endian byte order.
Signed-off-by: Alexander Graf <agraf@suse.de>
There are a few shared data structures between the host and the guest. Most
of them get registered through the VPA interface.
These data structures are defined to always be in big endian byte order, so
let's make sure we always access them in big endian.
Signed-off-by: Alexander Graf <agraf@suse.de>
When running on an LE host all data structures are kept in little endian
byte order. However, the HTAB still needs to be maintained in big endian.
So every time we access any HTAB we need to make sure we do so in the right
byte order. Fix up all accesses to manually byte swap.
Signed-off-by: Alexander Graf <agraf@suse.de>
Tlb search operation used for victim hint relies on the default tlb set by the
host. When hardware tablewalk support is enabled in the host, the default tlb is
TLB1 which leads KVM to evict the bolted entry. Set and restore the default tlb
when searching for victim hint.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds support for the H_SET_MODE hcall. This hcall is a
multiplexer that has several functions, some of which are called
rarely, and some which are potentially called very frequently.
Here we add support for the functions that set the debug registers
CIABR (Completed Instruction Address Breakpoint Register) and
DAWR/DAWRX (Data Address Watchpoint Register and eXtension),
since they could be updated by the guest as often as every context
switch.
This also adds a kvmppc_power8_compatible() function to test to see
if a guest is compatible with POWER8 or not. The CIABR and DAWR/X
only exist on POWER8.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL
capability is used to enable or disable in-kernel handling of an
hcall, that the hcall is actually implemented by the kernel.
If not an EINVAL error is returned.
This also checks the default-enabled list of hcalls and prints a
warning if any hcall there is not actually implemented.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This provides a way for userspace controls which sPAPR hcalls get
handled in the kernel. Each hcall can be individually enabled or
disabled for in-kernel handling, except for H_RTAS. The exception
for H_RTAS is because userspace can already control whether
individual RTAS functions are handled in-kernel or not via the
KVM_PPC_RTAS_DEFINE_TOKEN ioctl, and because the numeric value for
H_RTAS is out of the normal sequence of hcall numbers.
Hcalls are enabled or disabled using the KVM_ENABLE_CAP ioctl for the
KVM_CAP_PPC_ENABLE_HCALL capability on the file descriptor for the VM.
The args field of the struct kvm_enable_cap specifies the hcall number
in args[0] and the enable/disable flag in args[1]; 0 means disable
in-kernel handling (so that the hcall will always cause an exit to
userspace) and 1 means enable. Enabling or disabling in-kernel
handling of an hcall is effective across the whole VM.
The ability for KVM_ENABLE_CAP to be used on a VM file descriptor
on PowerPC is new, added by this commit. The KVM_CAP_ENABLE_CAP_VM
capability advertises that this ability exists.
When a VM is created, an initial set of hcalls are enabled for
in-kernel handling. The set that is enabled is the set that have
an in-kernel implementation at this point. Any new hcall
implementations from this point onwards should not be added to the
default set without a good reason.
No distinction is made between real-mode and virtual-mode hcall
implementations; the one setting controls them both.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
On vcpu schedule, the condition checked for tlb pollution is too loose.
The tlb entries of a vcpu become polluted (vs stale) only when a different
vcpu within the same logical partition runs in-between. Optimize the tlb
invalidation condition keeping last_vcpu per logical partition id.
With the new invalidation condition, a guest shows 4% performance improvement
on P5020DS while running a memory stress application with the cpu oversubscribed,
the other guest running a cpu intensive workload.
Guest - old invalidation condition
real 3.89
user 3.87
sys 0.01
Guest - enhanced invalidation condition
real 3.75
user 3.73
sys 0.01
Host
real 3.70
user 1.85
sys 0.00
The memory stress application accesses 4KB pages backed by 75% of available
TLB0 entries:
char foo[ENTRIES][4096] __attribute__ ((aligned (4096)));
int main()
{
char bar;
int i, j;
for (i = 0; i < ITERATIONS; i++)
for (j = 0; j < ENTRIES; j++)
bar = foo[j][0];
return 0;
}
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
While sending sparse with endian checks over the code base, it triggered at
some places that were missing casts or had wrong types. Fix them up.
Signed-off-by: Alexander Graf <agraf@suse.de>
We switched to ABIv2 on Little Endian systems now which gets rid of the
dotted function names. Branch to the actual functions when we see such
a system.
Signed-off-by: Alexander Graf <agraf@suse.de>
Both kvmppc_hv_entry_trampoline and kvmppc_entry_trampoline are
assembly functions that are exported to modules and also require
a valid r2.
As such we need to use _GLOBAL_TOC so we provide a global entry
point that establishes the TOC (r2).
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
To establish addressability quickly, ABIv2 requires the target
address of the function being called to be in r12.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
If we're running PR KVM in HV mode, we may get hypervisor doorbell interrupts.
Handle those the same way we treat normal doorbells.
Signed-off-by: Alexander Graf <agraf@suse.de>
When we're using PR KVM we must not allow the CPU to take interrupts
in virtual mode, as the SLB does not contain host kernel mappings
when running inside the guest context.
To make sure we get good performance for non-KVM tasks but still
properly functioning PR KVM, let's just disable AIL whenever a vcpu
is scheduled in.
This is fundamentally different from how we deal with AIL on pSeries
type machines where we disable AIL for the whole machine as soon as
a single KVM VM is up.
The reason for that is easy - on pSeries we do not have control over
per-cpu configuration of AIL. We also don't want to mess with CPU hotplug
races and AIL configuration, so setting it per CPU is easier and more
flexible.
This patch fixes running PR KVM on POWER8 bare metal for me.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Writing to IC is not allowed in the privileged mode.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
virtual time base register is a per VM, per cpu register that needs
to be saved and restored on vm exit and entry. Writing to VTB is not
allowed in the privileged mode.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: fix compile error]
Signed-off-by: Alexander Graf <agraf@suse.de>
We use time base for PURR and SPURR emulation with PR KVM since we
are emulating a single threaded core. When using time base
we need to make sure that we don't accumulate time spent in the host
in PURR and SPURR value.
Also we don't need to emulate mtspr because both the registers are
hypervisor resource.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently we forward MCEs to guest which have been recovered by guest.
And for unhandled errors we do not deliver the MCE to guest. It looks like
with no support of FWNMI in qemu, guest just panics whenever we deliver the
recovered MCEs to guest. Also, the existig code used to return to host for
unhandled errors which was casuing guest to hang with soft lockups inside
guest and makes it difficult to recover guest instance.
This patch now forwards all fatal MCEs to guest causing guest to crash/panic.
And, for recovered errors we just go back to normal functioning of guest
instead of returning to host. This fixes soft lockup issues in guest.
This patch also fixes an issue where guest MCE events were not logged to
host console.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull powerpc updates from Ben Herrenschmidt:
"Here is the bulk of the powerpc changes for this merge window. It got
a bit delayed in part because I wasn't paying attention, and in part
because I discovered I had a core PCI change without a PCI maintainer
ack in it. Bjorn eventually agreed it was ok to merge it though we'll
probably improve it later and I didn't want to rebase to add his ack.
There is going to be a bit more next week, essentially fixes that I
still want to sort through and test.
The biggest item this time is the support to build the ppc64 LE kernel
with our new v2 ABI. We previously supported v2 userspace but the
kernel itself was a tougher nut to crack. This is now sorted mostly
thanks to Anton and Rusty.
We also have a fairly big series from Cedric that add support for
64-bit LE zImage boot wrapper. This was made harder by the fact that
traditionally our zImage wrapper was always 32-bit, but our new LE
toolchains don't really support 32-bit anymore (it's somewhat there
but not really "supported") so we didn't want to rely on it. This
meant more churn that just endian fixes.
This brings some more LE bits as well, such as the ability to run in
LE mode without a hypervisor (ie. under OPAL firmware) by doing the
right OPAL call to reinitialize the CPU to take HV interrupts in the
right mode and the usual pile of endian fixes.
There's another series from Gavin adding EEH improvements (one day we
*will* have a release with less than 20 EEH patches, I promise!).
Another highlight is the support for the "Split core" functionality on
P8 by Michael. This allows a P8 core to be split into "sub cores" of
4 threads which allows the subcores to run different guests under KVM
(the HW still doesn't support a partition per thread).
And then the usual misc bits and fixes ..."
[ Further delayed by gmail deciding that BenH is a dirty spammer.
Google knows. ]
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (155 commits)
powerpc/powernv: Add missing include to LPC code
selftests/powerpc: Test the THP bug we fixed in the previous commit
powerpc/mm: Check paca psize is up to date for huge mappings
powerpc/powernv: Pass buffer size to OPAL validate flash call
powerpc/pseries: hcall functions are exported to modules, need _GLOBAL_TOC()
powerpc: Exported functions __clear_user and copy_page use r2 so need _GLOBAL_TOC()
powerpc/powernv: Set memory_block_size_bytes to 256MB
powerpc: Allow ppc_md platform hook to override memory_block_size_bytes
powerpc/powernv: Fix endian issues in memory error handling code
powerpc/eeh: Skip eeh sysfs when eeh is disabled
powerpc: 64bit sendfile is capped at 2GB
powerpc/powernv: Provide debugfs access to the LPC bus via OPAL
powerpc/serial: Use saner flags when creating legacy ports
powerpc: Add cpu family documentation
powerpc/xmon: Fix up xmon format strings
powerpc/powernv: Add calls to support little endian host
powerpc: Document sysfs DSCR interface
powerpc: Fix regression of per-CPU DSCR setting
powerpc: Split __SYSFS_SPRSETUP macro
arch: powerpc/fadump: Cleaning up inconsistent NULL checks
...
was a pretty active cycle for KVM. Changes include:
- a lot of s390 changes: optimizations, support for migration,
GDB support and more
- ARM changes are pretty small: support for the PSCI 0.2 hypercall
interface on both the guest and the host (the latter acked by Catalin)
- initial POWER8 and little-endian host support
- support for running u-boot on embedded POWER targets
- pretty large changes to MIPS too, completing the userspace interface
and improving the handling of virtualized timer hardware
- for x86, a larger set of changes is scheduled for 3.17. Still,
we have a few emulator bugfixes and support for running nested
fully-virtualized Xen guests (para-virtualized Xen guests have
always worked). And some optimizations too.
The only missing architecture here is ia64. It's not a coincidence
that support for KVM on ia64 is scheduled for removal in 3.17.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJTjtlBAAoJEBvWZb6bTYbyMOUP/2NAePghE3IjG99ikHFdn+BX
BfrURsuR6GD0AhYQnBidBmpFbAmN/LwSJxv/M7sV7OBRWLu3qbt69DrPTU2e/FK1
j9q25peu8jRyHzJ1q9rBroo74nD9lQYuVr3uXNxxcg0DRnw14JHGlM3y8LDEknO8
W+gpWTeAQ+2AuOX98MpRbCRMuzziCSv5bP5FhBVnsWHiZfvMbcUrbeJt+zYSiDAZ
0tHm/5dFKzfj/vVrrnjD4EZcRr688Bs5rztG96hY6aoVJryjZGLtLp92wCWkRRmH
CCvZwd245NmNthuKHzcs27/duSWfU0uOlu7AMrD44QYhzeDGyB/2nbCxbGqLLoBA
nnOviXH4cC65/CnisZ79zfo979HbZcX+Lzg747EjBgCSxJmLlwgiG8yXtDvk5otB
TH6GUeGDiEEPj//JD3XtgSz0sF2NvjREWRyemjDMvhz6JC/bLytXKb3sn+NXSj8m
ujzF9eQoa4qKDcBL4IQYGTJ4z5nY3Pd68dHFIPHB7n82OxFLSQUBKxXw8/1fb5og
VVb8PL4GOcmakQlAKtTMlFPmuy4bbL2r/2iV5xJiOZKmXIu8Hs1JezBE3SFAltbl
3cAGwSM9/dDkKxUbTFblyOE9bkKbg4WYmq0LkdzsPEomb3IZWntOT25rYnX+LrBz
bAknaZpPiOrW11Et1htY
=j5Od
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next
Pull KVM updates from Paolo Bonzini:
"At over 200 commits, covering almost all supported architectures, this
was a pretty active cycle for KVM. Changes include:
- a lot of s390 changes: optimizations, support for migration, GDB
support and more
- ARM changes are pretty small: support for the PSCI 0.2 hypercall
interface on both the guest and the host (the latter acked by
Catalin)
- initial POWER8 and little-endian host support
- support for running u-boot on embedded POWER targets
- pretty large changes to MIPS too, completing the userspace
interface and improving the handling of virtualized timer hardware
- for x86, a larger set of changes is scheduled for 3.17. Still, we
have a few emulator bugfixes and support for running nested
fully-virtualized Xen guests (para-virtualized Xen guests have
always worked). And some optimizations too.
The only missing architecture here is ia64. It's not a coincidence
that support for KVM on ia64 is scheduled for removal in 3.17"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
KVM: add missing cleanup_srcu_struct
KVM: PPC: Book3S PR: Rework SLB switching code
KVM: PPC: Book3S PR: Use SLB entry 0
KVM: PPC: Book3S HV: Fix machine check delivery to guest
KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
KVM: PPC: Book3S HV: Fix dirty map for hugepages
KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
KVM: PPC: Book3S: Add ONE_REG register names that were missed
KVM: PPC: Add CAP to indicate hcall fixes
KVM: PPC: MPIC: Reset IRQ source private members
KVM: PPC: Graciously fail broken LE hypercalls
PPC: ePAPR: Fix hypercall on LE guest
KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
KVM: PPC: BOOK3S: Always use the saved DAR value
PPC: KVM: Make NX bit available with magic page
KVM: PPC: Disable NX for old magic page using guests
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
...
On LPAR guest systems Linux enables the shadow SLB to indicate to the
hypervisor a number of SLB entries that always have to be available.
Today we go through this shadow SLB and disable all ESID's valid bits.
However, pHyp doesn't like this approach very much and honors us with
fancy machine checks.
Fortunately the shadow SLB descriptor also has an entry that indicates
the number of valid entries following. During the lifetime of a guest
we can just swap that value to 0 and don't have to worry about the
SLB restoration magic.
While we're touching the code, let's also make it more readable (get
rid of rldicl), allow it to deal with a dynamic number of bolted
SLB entries and only do shadow SLB swizzling on LPAR systems.
Signed-off-by: Alexander Graf <agraf@suse.de>
We didn't make use of SLB entry 0 because ... of no good reason. SLB entry 0
will always be used by the Linux linear SLB entry, so the fact that slbia
does not invalidate it doesn't matter as we overwrite SLB 0 on exit anyway.
Just enable use of SLB entry 0 for our shadow SLB code.
Signed-off-by: Alexander Graf <agraf@suse.de>
The code that delivered a machine check to the guest after handling
it in real mode failed to load up r11 before calling kvmppc_msr_interrupt,
which needs the old MSR value in r11 so it can see the transactional
state there. This adds the missing load.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds workarounds for two hardware bugs in the POWER8 performance
monitor unit (PMU), both related to interrupt generation. The effect
of these bugs is that PMU interrupts can get lost, leading to tools
such as perf reporting fewer counts and samples than they should.
The first bug relates to the PMAO (perf. mon. alert occurred) bit in
MMCR0; setting it should cause an interrupt, but doesn't. The other
bug relates to the PMAE (perf. mon. alert enable) bit in MMCR0.
Setting PMAE when a counter is negative and counter negative
conditions are enabled to cause alerts should cause an alert, but
doesn't.
The workaround for the first bug is to create conditions where a
counter will overflow, whenever we are about to restore a MMCR0
value that has PMAO set (and PMAO_SYNC clear). The workaround for
the second bug is to freeze all counters using MMCR2 before reading
MMCR0.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>