IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 3c4d296e58 upstream.
MMC3 hwmod data is missing the module_offs definition. MMC3 belongs under
core, so add CORE_MOD for it.
Signed-off-by: Tero Kristo <t-kristo@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Cc: Adam Ford <aford173@gmail.com>
Fixes: 6c0afb5039 ("clk: ti: convert to use proper register definition for all accesses")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 45d55e7bac upstream.
Keith reported the following warning:
WARNING: CPU: 28 PID: 1420 at kernel/irq/matrix.c:222 irq_matrix_remove_managed+0x10f/0x120
x86_vector_free_irqs+0xa1/0x180
x86_vector_alloc_irqs+0x1e4/0x3a0
msi_domain_alloc+0x62/0x130
The reason for this is that if the vector allocation fails the error
handling code tries to free the failed vector as well, which causes the
above imbalance warning to trigger.
Adjust the error path to handle this correctly.
Fixes: b5dc8e6c21 ("x86/irq: Use hierarchical irqdomain to manage CPU interrupt vectors")
Reported-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Keith Busch <keith.busch@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801161217300.1823@nanos
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b511203093 upstream.
The INTEL_FAM6_SKYLAKE_X hardcoded crystal_khz value of 25MHZ is
problematic:
- SKX workstations (with same model # as server variants) use a 24 MHz
crystal. This results in a -4.0% time drift rate on SKX workstations.
- SKX servers subject the crystal to an EMI reduction circuit that reduces its
actual frequency by (approximately) -0.25%. This results in -1 second per
10 minute time drift as compared to network time.
This issue can also trigger a timer and power problem, on configurations
that use the LAPIC timer (versus the TSC deadline timer). Clock ticks
scheduled with the LAPIC timer arrive a few usec before the time they are
expected (according to the slow TSC). This causes Linux to poll-idle, when
it should be in an idle power saving state. The idle and clock code do not
graciously recover from this error, sometimes resulting in significant
polling and measurable power impact.
Stop using native_calibrate_tsc() for INTEL_FAM6_SKYLAKE_X.
native_calibrate_tsc() will return 0, boot will run with tsc_khz = cpu_khz,
and the TSC refined calibration will update tsc_khz to correct for the
difference.
[ tglx: Sanitized change log ]
Fixes: 6baf3d6182 ("x86/tsc: Add additional Intel CPU models to the crystal quirk list")
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Cc: Prarit Bhargava <prarit@redhat.com>
Link: https://lkml.kernel.org/r/ff6dcea166e8ff8f2f6a03c17beab2cb436aa779.1513920414.git.len.brown@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit da4ae6c4a0 upstream.
If the crystal frequency cannot be determined via CPUID(15).crystal_khz or
the built-in table then native_calibrate_tsc() will still set the
X86_FEATURE_TSC_KNOWN_FREQ flag which prevents the refined TSC calibration.
As a consequence such systems use cpu_khz for the TSC frequency which is
incorrect when cpu_khz != tsc_khz resulting in time drift.
Return early when the crystal frequency cannot be retrieved without setting
the X86_FEATURE_TSC_KNOWN_FREQ flag. This ensures that the refined TSC
calibration is invoked.
[ tglx: Steam-blastered changelog. Sigh ]
Fixes: 4ca4df0b7e ("x86/tsc: Mark TSC frequency determined by CPUID as known")
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Cc: Bin Gao <bin.gao@intel.com>
Link: https://lkml.kernel.org/r/0fe2503aa7d7fc69137141fc705541a78101d2b9.1513920414.git.len.brown@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit beacd6f7ed upstream.
SEGV_PKUERR is a signal specific si_code which happens to have the same
numeric value as several others: BUS_MCEERR_AR, ILL_ILLTRP, FPE_FLTOVF,
TRAP_HWBKPT, CLD_TRAPPED, POLL_ERR, SEGV_THREAD_ID, as such it is not safe
to just test the si_code the signal number must also be tested to prevent a
false positive in fill_sig_info_pkey.
This error was by inspection, and BUS_MCEERR_AR appears to be a real
candidate for confusion. So pass in si_signo and check for SIG_SEGV to
verify that it is actually a SEGV_PKUERR
Fixes: 019132ff3d ("x86/mm/pkeys: Fill in pkey field in siginfo")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/20180112203135.4669-2-ebiederm@xmission.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c995efd5a7 upstream.
On context switch from a shallow call stack to a deeper one, as the CPU
does 'ret' up the deeper side it may encounter RSB entries (predictions for
where the 'ret' goes to) which were populated in userspace.
This is problematic if neither SMEP nor KPTI (the latter of which marks
userspace pages as NX for the kernel) are active, as malicious code in
userspace may then be executed speculatively.
Overwrite the CPU's return prediction stack with calls which are predicted
to return to an infinite loop, to "capture" speculation if this
happens. This is required both for retpoline, and also in conjunction with
IBRS for !SMEP && !KPTI.
On Skylake+ the problem is slightly different, and an *underflow* of the
RSB may cause errant branch predictions to occur. So there it's not so much
overwrite, as *filling* the RSB to attempt to prevent it getting
empty. This is only a partial solution for Skylake+ since there are many
other conditions which may result in the RSB becoming empty. The full
solution on Skylake+ is to use IBRS, which will prevent the problem even
when the RSB becomes empty. With IBRS, the RSB-stuffing will not be
required on context switch.
[ tglx: Added missing vendor check and slighty massaged comments and
changelog ]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: thomas.lendacky@amd.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: https://lkml.kernel.org/r/1515779365-9032-1-git-send-email-dwmw@amazon.co.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0d39e2669d upstream.
Currently KASAN doesn't panic in case it don't have enough memory
to boot. Instead, it crashes in some random place:
kernel BUG at arch/x86/mm/physaddr.c:27!
RIP: 0010:__phys_addr+0x268/0x276
Call Trace:
kasan_populate_shadow+0x3f2/0x497
kasan_init+0x12e/0x2b2
setup_arch+0x2825/0x2a2c
start_kernel+0xc8/0x15f4
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x72/0x75
secondary_startup_64+0xa5/0xb0
Use memblock_virt_alloc_try_nid() for allocations without failure
fallback. It will panic with an out of memory message.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: kasan-dev@googlegroups.com
Cc: Alexander Potapenko <glider@google.com>
Cc: lkp@01.org
Link: https://lkml.kernel.org/r/20180110153602.18919-1-aryabinin@virtuozzo.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6e032b350c upstream.
New device-tree properties are available which tell the hypervisor
settings related to the RFI flush. Use them to determine the
appropriate flush instruction to use, and whether the flush is
required.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8989d56878 upstream.
A new hypervisor call is available which tells the guest settings
related to the RFI flush. Use it to query the appropriate flush
instruction(s), and whether the flush is required.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bc9c9304a4 upstream.
Because there may be some performance overhead of the RFI flush, add
kernel command line options to disable it.
We add a sensibly named 'no_rfi_flush' option, but we also hijack the
x86 option 'nopti'. The RFI flush is not the same as KPTI, but if we
see 'nopti' we can guess that the user is trying to avoid any overhead
of Meltdown mitigations, and it means we don't have to educate every
one about a different command line option.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit aa8a5e0062 upstream.
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c7305645eb upstream.
In the SLB miss handler we may be returning to user or kernel. We need
to add a check early on and save the result in the cr4 register, and
then we bifurcate the return path based on that.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a08f828cf4 upstream.
Similar to the syscall return path, in fast_exception_return we may be
returning to user or kernel context. We already have a test for that,
because we conditionally restore r13. So use that existing test and
branch, and bifurcate the return based on that.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b8e90cb7bc upstream.
In the syscall exit path we may be returning to user or kernel
context. We already have a test for that, because we conditionally
restore r13. So use that existing test and branch, and bifurcate the
return based on that.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 222f20f140 upstream.
This commit does simple conversions of rfi/rfid to the new macros that
include the expected destination context. By simple we mean cases
where there is a single well known destination context, and it's
simply a matter of substituting the instruction for the appropriate
macro.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 50e51c13b3 upstream.
The rfid/hrfid ((Hypervisor) Return From Interrupt) instruction is
used for switching from the kernel to userspace, and from the
hypervisor to the guest kernel. However it can and is also used for
other transitions, eg. from real mode kernel code to virtual mode
kernel code, and it's not always clear from the code what the
destination context is.
To make it clearer when reading the code, add macros which encode the
expected destination context.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 191eccb158 upstream.
A new hypervisor call has been defined to communicate various
characteristics of the CPU to guests. Add definitions for the hcall
number, flags and a wrapper function.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f10ee3dcc9 upstream.
The switch to the user space page tables in the low level ASM code sets
unconditionally bit 12 and bit 11 of CR3. Bit 12 is switching the base
address of the page directory to the user part, bit 11 is switching the
PCID to the PCID associated with the user page tables.
This fails on a machine which lacks PCID support because bit 11 is set in
CR3. Bit 11 is reserved when PCID is inactive.
While the Intel SDM claims that the reserved bits are ignored when PCID is
disabled, the AMD APM states that they should be cleared.
This went unnoticed as the AMD APM was not checked when the code was
developed and reviewed and test systems with Intel CPUs never failed to
boot. The report is against a Centos 6 host where the guest fails to boot,
so it's not yet clear whether this is a virt issue or can happen on real
hardware too, but thats irrelevant as the AMD APM clearly ask for clearing
the reserved bits.
Make sure that on non PCID machines bit 11 is not set by the page table
switching code.
Andy suggested to rename the related bits and masks so they are clearly
describing what they should be used for, which is done as well for clarity.
That split could have been done with alternatives but the macro hell is
horrible and ugly. This can be done on top if someone cares to remove the
extra orq. For now it's a straight forward fix.
Fixes: 6fd166aae7 ("x86/mm: Use/Fix PCID to optimize user/kernel switches")
Reported-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Willy Tarreau <w@1wt.eu>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801140009150.2371@nanos
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 76b043848f upstream.
Enable the use of -mindirect-branch=thunk-extern in newer GCC, and provide
the corresponding thunks. Provide assembler macros for invoking the thunks
in the same way that GCC does, from native and inline assembler.
This adds X86_FEATURE_RETPOLINE and sets it by default on all CPUs. In
some circumstances, IBRS microcode features may be used instead, and the
retpoline can be disabled.
On AMD CPUs if lfence is serialising, the retpoline can be dramatically
simplified to a simple "lfence; jmp *\reg". A future patch, after it has
been verified that lfence really is serialising in all circumstances, can
enable this by setting the X86_FEATURE_RETPOLINE_AMD feature bit in addition
to X86_FEATURE_RETPOLINE.
Do not align the retpoline in the altinstr section, because there is no
guarantee that it stays aligned when it's copied over the oldinstr during
alternative patching.
[ Andi Kleen: Rename the macros, add CONFIG_RETPOLINE option, export thunks]
[ tglx: Put actual function CALL/JMP in front of the macros, convert to
symbolic labels ]
[ dwmw2: Convert back to numeric labels, merge objtool fixes ]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: thomas.lendacky@amd.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: https://lkml.kernel.org/r/1515707194-20531-4-git-send-email-dwmw@amazon.co.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8d56eff266 upstream.
The following code contains dead logic:
162 if (pgd_none(*pgd)) {
163 unsigned long new_p4d_page = __get_free_page(gfp);
164 if (!new_p4d_page)
165 return NULL;
166
167 if (pgd_none(*pgd)) {
168 set_pgd(pgd, __pgd(_KERNPG_TABLE | __pa(new_p4d_page)));
169 new_p4d_page = 0;
170 }
171 if (new_p4d_page)
172 free_page(new_p4d_page);
173 }
There can't be any difference between two pgd_none(*pgd) at L162 and L167,
so it's always false at L171.
Dave Hansen explained:
Yes, the double-test was part of an optimization where we attempted to
avoid using a global spinlock in the fork() path. We would check for
unallocated mid-level page tables without the lock. The lock was only
taken when we needed to *make* an entry to avoid collisions.
Now that it is all single-threaded, there is no chance of a collision,
no need for a lock, and no need for the re-check.
As all these functions are only called during init, mark them __init as
well.
Fixes: 03f4424f34 ("x86/mm/pti: Add functions to clone kernel PMDs")
Signed-off-by: Jike Song <albcamus@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Jiri Koshina <jikos@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kees Cook <keescook@google.com>
Cc: Andi Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg KH <gregkh@linux-foundation.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Paul Turner <pjt@google.com>
Link: https://lkml.kernel.org/r/20180108160341.3461-1-albcamus@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ecba8297aa upstream.
The KVM_PPC_ALLOCATE_HTAB ioctl(), implemented by kvmppc_alloc_reset_hpt()
is supposed to completely clear and reset a guest's Hashed Page Table (HPT)
allocating or re-allocating it if necessary.
In the case where an HPT of the right size already exists and it just
zeroes it, it forces a TLB flush on all guest CPUs, to remove any stale TLB
entries loaded from the old HPT.
However, that situation can arise when the HPT is resizing as well - or
even when switching from an RPT to HPT - so those cases need a TLB flush as
well.
So, move the TLB flush to trigger in all cases except for errors.
Fixes: f98a8bf9ee ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4ed11aeefd upstream.
When serving multiple resize requests following could happen:
CPU0 CPU1
---- ----
kvm_vm_ioctl_resize_hpt_prepare(1);
-> schedule_work()
/* system_rq might be busy: delay */
kvm_vm_ioctl_resize_hpt_prepare(2);
mutex_lock();
if (resize) {
...
release_hpt_resize();
}
... resize_hpt_prepare_work()
-> schedule_work() {
mutex_unlock() /* resize->kvm could be wrong */
struct kvm *kvm = resize->kvm;
mutex_lock(&kvm->lock); <<<< UAF
...
}
i.e. a second resize request with different order could be started by
kvm_vm_ioctl_resize_hpt_prepare(), causing the previous request to be
free()d when there's still an active worker thread which will try to
access it. This leads to a use after free in point marked with UAF on
the diagram above.
To prevent this from happening, instead of unconditionally releasing a
pre-existing resize structure from the prepare ioctl(), we check if
the existing structure has an in-progress worker. We do that by
checking if the resize->error == -EBUSY, which is safe because the
resize->error field is protected by the kvm->lock. If there is an
active worker, instead of releasing, we mark the structure as stale by
unlinking it from kvm_struct.
In the worker thread we check for a stale structure (with kvm->lock
held), and in that case abort, releasing the stale structure ourself.
We make the check both before and the actual allocation. Strictly,
only the check afterwards is needed, the check before is an
optimization: if the structure happens to become stale before the
worker thread is dispatched, rather than during the allocation, it
means we can avoid allocating then immediately freeing a potentially
substantial amount of memory.
This fixes following or similar host kernel crash message:
[ 635.277361] Unable to handle kernel paging request for data at address 0x00000000
[ 635.277438] Faulting instruction address: 0xc00000000052f568
[ 635.277446] Oops: Kernel access of bad area, sig: 11 [#1]
[ 635.277451] SMP NR_CPUS=2048 NUMA PowerNV
[ 635.277470] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter nfsv3 nfs_acl nfs
lockd grace fscache kvm_hv kvm rpcrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi
scsi_transport_iscsi ib_srpt target_core_mod ext4 ib_srp scsi_transport_srp
ib_ipoib mbcache jbd2 rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ocrdma(T)
ib_core ses enclosure scsi_transport_sas sg shpchp leds_powernv ibmpowernv i2c_opal
i2c_core powernv_rng ipmi_powernv ipmi_devintf ipmi_msghandler ip_tables xfs
libcrc32c sr_mod sd_mod cdrom lpfc nvme_fc(T) nvme_fabrics nvme_core ipr nvmet_fc(T)
tg3 nvmet libata be2net crc_t10dif crct10dif_generic scsi_transport_fc ptp scsi_tgt
pps_core crct10dif_common dm_mirror dm_region_hash dm_log dm_mod
[ 635.278687] CPU: 40 PID: 749 Comm: kworker/40:1 Tainted: G
------------ T 3.10.0.bz1510771+ #1
[ 635.278782] Workqueue: events resize_hpt_prepare_work [kvm_hv]
[ 635.278851] task: c0000007e6840000 ti: c0000007e9180000 task.ti: c0000007e9180000
[ 635.278919] NIP: c00000000052f568 LR: c0000000009ea310 CTR: c0000000009ea4f0
[ 635.278988] REGS: c0000007e91837f0 TRAP: 0300 Tainted: G
------------ T (3.10.0.bz1510771+)
[ 635.279077] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002022 XER:
00000000
[ 635.279248] CFAR: c000000000009368 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
GPR00: c0000000009ea310 c0000007e9183a70 c000000001250b00 c0000007e9183b10
GPR04: 0000000000000000 0000000000000000 c0000007e9183650 0000000000000000
GPR08: c0000007ffff7b80 00000000ffffffff 0000000080000028 d00000000d2529a0
GPR12: 0000000000002200 c000000007b56800 c000000000120028 c0000007f135bb40
GPR16: 0000000000000000 c000000005c1e018 c000000005c1e018 0000000000000000
GPR20: 0000000000000001 c0000000011bf778 0000000000000001 fffffffffffffef7
GPR24: 0000000000000000 c000000f1e262e50 0000000000000002 c0000007e9180000
GPR28: c000000f1e262e4c c000000f1e262e50 0000000000000000 c0000007e9183b10
[ 635.280149] NIP [c00000000052f568] __list_add+0x38/0x110
[ 635.280197] LR [c0000000009ea310] __mutex_lock_slowpath+0xe0/0x2c0
[ 635.280253] Call Trace:
[ 635.280277] [c0000007e9183af0] [c0000000009ea310] __mutex_lock_slowpath+0xe0/0x2c0
[ 635.280356] [c0000007e9183b70] [c0000000009ea554] mutex_lock+0x64/0x70
[ 635.280426] [c0000007e9183ba0] [d00000000d24da04]
resize_hpt_prepare_work+0xe4/0x1c0 [kvm_hv]
[ 635.280507] [c0000007e9183c40] [c000000000113c0c] process_one_work+0x1dc/0x680
[ 635.280587] [c0000007e9183ce0] [c000000000114250] worker_thread+0x1a0/0x520
[ 635.280655] [c0000007e9183d80] [c00000000012010c] kthread+0xec/0x100
[ 635.280724] [c0000007e9183e30] [c00000000000a4b8] ret_from_kernel_thread+0x5c/0xa4
[ 635.280814] Instruction dump:
[ 635.280880] 7c0802a6 fba1ffe8 fbc1fff0 7cbd2b78 fbe1fff8 7c9e2378 7c7f1b78
f8010010
[ 635.281099] f821ff81 e8a50008 7fa52040 40de00b8 <e8be0000> 7fbd2840 40de008c
7fbff040
[ 635.281324] ---[ end trace b628b73449719b9d ]---
Fixes: b5baa68773 ("KVM: PPC: Book3S HV: KVM-HV HPT resizing implementation")
Signed-off-by: Serhii Popovych <spopovyc@redhat.com>
[dwg: Replaced BUG_ON()s with WARN_ONs() and reworded commit message
for clarity]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>