IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Recent Intel CPUs like Haswell and IvyBridge have a new
alternative MSR range for perfctrs that allows writing the full
counter width. Enable this range if the hardware reports it
using a new capability bit.
Currently the perf code queries CPUID to get the counter width,
and sign extends the counter values as needed. The traditional
PERFCTR MSRs always limit to 32bit, even though the counter
internally is larger (usually 48 bits on recent CPUs)
When the new capability is set use the alternative range which
do not have these restrictions.
This lowers the overhead of perf stat slightly because it has to
do less interrupts to accumulate the counter value. On Haswell
it also avoids some problems with TSX aborting when the end of
the counter range is reached.
( See the patch "perf/x86/intel: Avoid checkpointed counters
causing excessive TSX aborts" for more details. )
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1372173153-20215-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The control registers are unsigned long (32 bits on i386, 64 bits on
x86-64), and so make that manifest in the data type for the various
constants. Add defines with a _BIT suffix which defines the bit
number, as opposed to the bit mask.
This should resolve some issues with ~bitmask that Linus discovered.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-cwckhbrib2aux1qbteaebij0@git.kernel.org
Bit 1 in the x86 EFLAGS is always set. Name the macro something that
actually tries to explain what it is all about, rather than being a
tautology.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lkml.kernel.org/n/tip-f10rx5vjjm6tfnt8o1wseb3v@git.kernel.org
As load_current_idt() is now what is used to update the IDT for the
switches needed for NMI, lockdep debug, and for tracing, it must not
call local_irq_save(). This is because one of the users of this is
lockdep, which does tracing of local_irq_save() and when the debug
trap is hit, we need to update the IDT before tracing interrupts
being disabled. As load_current_idt() is used to do this, calling
local_irq_save() which lockdep traces, defeats the point of calling
load_current_idt().
As interrupts are already disabled when used by lockdep and NMI, the
only other user is tracing that can disable interrupts itself. Simply
have the tracing update disable interrupts before calling load_current_idt()
instead of breaking the other users.
Here's the dump that happened:
------------[ cut here ]------------
WARNING: at /work/autotest/nobackup/linux-test.git/kernel/fork.c:1196 copy_process+0x2c3/0x1398()
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled)
Modules linked in:
CPU: 1 PID: 4570 Comm: gdm-simple-gree Not tainted 3.10.0-rc3-test+ #5
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
ffffffff81d2a7a5 ffff88006ed13d50 ffffffff8192822b ffff88006ed13d90
ffffffff81035f25 ffff8800721c6000 ffff88006ed13da0 0000000001200011
0000000000000000 ffff88006ed5e000 ffff8800721c6000 ffff88006ed13df0
Call Trace:
[<ffffffff8192822b>] dump_stack+0x19/0x1b
[<ffffffff81035f25>] warn_slowpath_common+0x67/0x80
[<ffffffff81035fe1>] warn_slowpath_fmt+0x46/0x48
[<ffffffff812bfc5d>] ? __raw_spin_lock_init+0x31/0x52
[<ffffffff810341f7>] copy_process+0x2c3/0x1398
[<ffffffff8103539d>] do_fork+0xa8/0x260
[<ffffffff810ca7b1>] ? trace_preempt_on+0x2a/0x2f
[<ffffffff812afb3e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81937fe7>] ? sysret_check+0x1b/0x56
[<ffffffff81937fe7>] ? sysret_check+0x1b/0x56
[<ffffffff810355cf>] SyS_clone+0x16/0x18
[<ffffffff81938369>] stub_clone+0x69/0x90
[<ffffffff81937fc2>] ? system_call_fastpath+0x16/0x1b
---[ end trace 8b157a9d20ca1aa2 ]---
in fork.c:
#ifdef CONFIG_PROVE_LOCKING
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled); <-- bug here
DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This reverts commit cf1521a1a5e21fd1e79a458605c4282fbfbbeee2.
Instruction (vpgatherdd) that this implementation relied on turned out to be
slow performer on real hardware (i5-4570). The previous 8-way twofish/AVX
implementation is therefore faster and this implementation should be removed.
Converting this implementation to use the same method as in twofish/AVX for
table look-ups would give additional ~3% speed up vs twofish/AVX, but would
hardly be worth of the added code and binary size.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This reverts commit 604880107010a1e5794552d184cd5471ea31b973.
Instruction (vpgatherdd) that this implementation relied on turned out to be
slow performer on real hardware (i5-4570). The previous 4-way blowfish
implementation is therefore faster and this implementation should be removed.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
[Purpose of this patch]
As Vaibhav explained in the thread below, tracepoints for irq vectors
are useful.
http://www.spinics.net/lists/mm-commits/msg85707.html
<snip>
The current interrupt traces from irq_handler_entry and irq_handler_exit
provide when an interrupt is handled. They provide good data about when
the system has switched to kernel space and how it affects the currently
running processes.
There are some IRQ vectors which trigger the system into kernel space,
which are not handled in generic IRQ handlers. Tracing such events gives
us the information about IRQ interaction with other system events.
The trace also tells where the system is spending its time. We want to
know which cores are handling interrupts and how they are affecting other
processes in the system. Also, the trace provides information about when
the cores are idle and which interrupts are changing that state.
<snip>
On the other hand, my usecase is tracing just local timer event and
getting a value of instruction pointer.
I suggested to add an argument local timer event to get instruction pointer before.
But there is another way to get it with external module like systemtap.
So, I don't need to add any argument to irq vector tracepoints now.
[Patch Description]
Vaibhav's patch shared a trace point ,irq_vector_entry/irq_vector_exit, in all events.
But there is an above use case to trace specific irq_vector rather than tracing all events.
In this case, we are concerned about overhead due to unwanted events.
So, add following tracepoints instead of introducing irq_vector_entry/exit.
so that we can enable them independently.
- local_timer_vector
- reschedule_vector
- call_function_vector
- call_function_single_vector
- irq_work_entry_vector
- error_apic_vector
- thermal_apic_vector
- threshold_apic_vector
- spurious_apic_vector
- x86_platform_ipi_vector
Also, introduce a logic switching IDT at enabling/disabling time so that a time penalty
makes a zero when tracepoints are disabled. Detailed explanations are as follows.
- Create trace irq handlers with entering_irq()/exiting_irq().
- Create a new IDT, trace_idt_table, at boot time by adding a logic to
_set_gate(). It is just a copy of original idt table.
- Register the new handlers for tracpoints to the new IDT by introducing
macros to alloc_intr_gate() called at registering time of irq_vector handlers.
- Add checking, whether irq vector tracing is on/off, into load_current_idt().
This has to be done below debug checking for these reasons.
- Switching to debug IDT may be kicked while tracing is enabled.
- On the other hands, switching to trace IDT is kicked only when debugging
is disabled.
In addition, the new IDT is created only when CONFIG_TRACING is enabled to avoid being
used for other purposes.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/51C323ED.5050708@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Rename variables for debugging to describe meaning of them precisely.
Also, introduce a generic way to switch IDT by checking a current state,
debug on/off.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/51C323A8.7050905@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
When implementing tracepoints in interrupt handers, if the tracepoints are
simply added in the performance sensitive path of interrupt handers,
it may cause potential performance problem due to the time penalty.
To solve the problem, an idea is to prepare non-trace/trace irq handers and
switch their IDTs at the enabling/disabling time.
So, let's introduce entering_irq()/exiting_irq() for pre/post-
processing of each irq handler.
A way to use them is as follows.
Non-trace irq handler:
smp_irq_handler()
{
entering_irq(); /* pre-processing of this handler */
__smp_irq_handler(); /*
* common logic between non-trace and trace handlers
* in a vector.
*/
exiting_irq(); /* post-processing of this handler */
}
Trace irq_handler:
smp_trace_irq_handler()
{
entering_irq(); /* pre-processing of this handler */
trace_irq_entry(); /* tracepoint for irq entry */
__smp_irq_handler(); /*
* common logic between non-trace and trace handlers
* in a vector.
*/
trace_irq_exit(); /* tracepoint for irq exit */
exiting_irq(); /* post-processing of this handler */
}
If tracepoints can place outside entering_irq()/exiting_irq() as follows,
it looks cleaner.
smp_trace_irq_handler()
{
trace_irq_entry();
smp_irq_handler();
trace_irq_exit();
}
But it doesn't work.
The problem is with irq_enter/exit() being called. They must be called before
trace_irq_enter/exit(), because of the rcu_irq_enter() must be called before
any tracepoints are used, as tracepoints use rcu to synchronize.
As a possible alternative, we may be able to call irq_enter() first as follows
if irq_enter() can nest.
smp_trace_irq_hander()
{
irq_entry();
trace_irq_entry();
smp_irq_handler();
trace_irq_exit();
irq_exit();
}
But it doesn't work, either.
If irq_enter() is nested, it may have a time penalty because it has to check if it
was already called or not. The time penalty is not desired in performance sensitive
paths even if it is tiny.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/51C3238D.9040706@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQEcBAABAgAGBQJRvOHmAAoJEHm+PkMAQRiGpOYIAI/OMDjCMro8S1FjDPGr+wsz
r/u4h4DiHQTdAUSBYhksgXVBSm02hGhDF3tz7u9oAXuv+4zlGUMZNjmEmLOFfdyd
Ve9vqMrUkdwA0jt0d7AYVjtCy/j/yVe5szXZCksHXOZiyb78SEZPQgHc13CJ4lKI
K2jPk0sMko7d5ptALPIX+ome48M+3w3k1HuQ62Znm4pFSADcFGpGdf40HP/5LL+I
Llp3XHJ5OZrcHRal0t39oHJOCWW61bnYXTWel9J7XoUztebF2cRgFydAhtrto42z
x7ELOKVY197WH8l56cnVHrISneQd4kgWrooxLYy6QsJl73/qvQBX+7nmkoes7so=
=qeV0
-----END PGP SIGNATURE-----
Merge tag 'v3.10-rc6' into x86/cleanups
Linux 3.10-rc6
We need a change that is the mainline tree for further work.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The call stack below shows how this happens: basically eager_fpu_init()
calls __thread_fpu_begin(current) which then does if (!use_eager_fpu()),
which, in turn, uses static_cpu_has.
And we're executing before alternatives so static_cpu_has doesn't work
there yet.
Use the safe variant in this path which becomes optimal after
alternatives have run.
WARNING: at arch/x86/kernel/cpu/common.c:1368 warn_pre_alternatives+0x1e/0x20()
You're using static_cpu_has before alternatives have run!
Modules linked in:
Pid: 0, comm: swapper Not tainted 3.9.0-rc8+ #1
Call Trace:
warn_slowpath_common
warn_slowpath_fmt
? fpu_finit
warn_pre_alternatives
eager_fpu_init
fpu_init
cpu_init
trap_init
start_kernel
? repair_env_string
x86_64_start_reservations
x86_64_start_kernel
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1370772454-6106-6-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
We want to use this in early code where alternatives might not have run
yet and for that case we fall back to the dynamic boot_cpu_has.
For that, force a 5-byte jump since the compiler could be generating
differently sized jumps for each label.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1370772454-6106-5-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
static_cpu_has may be used only after alternatives have run. Before that
it always returns false if constant folding with __builtin_constant_p()
doesn't happen. And you don't want that.
This patch is the result of me debugging an issue where I overzealously
put static_cpu_has in code which executed before alternatives have run
and had to spend some time with scratching head and cursing at the
monitor.
So add a jump to a warning which screams loudly when we use this
function too early. The alternatives patch that check away in
conjunction with patching the rest of the kernel image.
[ hpa: factored this into its own configuration option. If we want to
have an overarching option, it should be an option which selects
other options, not as a group option in the source code. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1370772454-6106-4-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The following change fixes the x86 implementation of
trigger_all_cpu_backtrace(), which was previously (accidentally,
as far as I can tell) disabled to always return false as on
architectures that do not implement this function.
trigger_all_cpu_backtrace(), as defined in include/linux/nmi.h,
should call arch_trigger_all_cpu_backtrace() if available, or
return false if the underlying arch doesn't implement this
function.
x86 did provide a suitable arch_trigger_all_cpu_backtrace()
implementation, but it wasn't actually being used because it was
declared in asm/nmi.h, which linux/nmi.h doesn't include. Also,
linux/nmi.h couldn't easily be fixed by including asm/nmi.h,
because that file is not available on all architectures.
I am proposing to fix this by moving the x86 definition of
arch_trigger_all_cpu_backtrace() to asm/irq.h.
Tested via: echo l > /proc/sysrq-trigger
Before the change, this uses a fallback implementation which
shows backtraces on active CPUs (using
smp_call_function_interrupt() )
After the change, this shows NMI backtraces on all CPUs
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1370518875-1346-1-git-send-email-walken@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are in the process of removing all the __cpuinit annotations.
While working on making that change, an existing problem was
made evident:
WARNING: arch/x86/kernel/built-in.o(.text+0x198f2): Section mismatch
in reference from the function cpu_init() to the function
.init.text:load_ucode_ap() The function cpu_init() references
the function __init load_ucode_ap(). This is often because cpu_init
lacks a __init annotation or the annotation of load_ucode_ap is wrong.
This now appears because in my working tree, cpu_init() is no longer
tagged as __cpuinit, and so the audit picks up the mismatch. The 2nd
hypothesis from the audit is the correct one, as there was an incorrect
__init tag on the prototype in the header (but __cpuinit was used on
the function itself.)
The audit is telling us that the prototype's __init annotation took
effect and the function did land in the .init.text section. Checking
with objdump on a mainline tree that still has __cpuinit shows that
the __cpuinit on the function takes precedence over the __init on the
prototype, but that won't be true once we make __cpuinit a no-op.
Even though we are removing __cpuinit, we temporarily align both
the function and the prototype on __cpuinit so that the changeset
can be applied to stable trees if desired.
[ hpa: build fix only, no object code change ]
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: stable <stable@vger.kernel.org> # 3.9+
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/1371654926-11729-1-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Which by default will be x86_acpi_suspend_lowlevel.
This registration allows us to register another callback
if there is a need to use another platform specific callback.
Signed-off-by: Liang Tang <liang.tang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Ben Guthro <benjamin.guthro@citrix.com>
Acked-by: "H. Peter Anvin" <hpa@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Similar to SandyBridge, but has a few new events and two
new counter bits.
There are some new counter flags that need to be prevented
from being set on fixed counters, and allowed to be set
for generic counters.
Also we add support for the counter 2 constraint to handle
all raw events.
(Contains fixes from Stephane Eranian.)
Reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Andi Kleen <ak@linux.jf.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1371515812-9646-3-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
few users were reporting boot regressions in v3.9. This has now been
fixed with a more accurate "minimum storage requirement to avoid
bricking" value from Samsung (5K instead of 50%) and code to trigger
garbage collection when we near our limit - Matthew Garrett.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJRtkY2AAoJEC84WcCNIz1VJOsP/00xwiY4VKh2RfqNkYKSl/w5
gEshIHFEAXHX5X8C4ReocZVywvdjTgbJoKBbBy3FePYRzLddrmavvjen17hk7BzS
/cO8/eXForkNWCGR1kLagA6HLpgKP5DPayKizoMb4Mg6muzfT1SCcN6Pzh8cDMWe
btcq/l9JZejXdJ4Wfoq1My+WdXs19OT/BNeD3y65K4x29vNUjop6oaIdDJWLlH/S
aeLHh8d4xbSHNWzK1fBP7CnFTYU27xxs1BFNAReU6McxeQCYZAIaRovYnjTZEvfJ
twd2tLrOn9HBVTbWa8T4XGNSr+QcT4XGMadLvdwuqltmKDfH6Onm8aWQM3IqA7gy
Qimbcv2B7HrITgXWTzp3DPkXF1LA8/8QHSBXVMUU9Rl6QOLy18vIdKiQy3M1Ng9Z
0q+Ow93JtnL11zf9wLDMdKaKcA9HOxbG/wRTK6XO4vGaWj9brFv3n5Ib7OreHH6D
GP58zDEnThFuj97K/NKREBZZFcFOMZpKk5MAipVkzltihUQmNeTF/dAtBJ3Ncu/A
PqQE6uuKVXjASJR8Gy0bI3WHtSTZK4L/sg9c2MF3bdJa9BswN+m8IEbls+S+iFOx
+sYPQx7Zw6SFENxDw8cDYNzC14yfr60qyOxTWfkHH7l/FnvhOgwHzqPsLcXx0ouR
C6k1yPYSTgiqFdWC2sjn
=TZuM
-----END PGP SIGNATURE-----
Merge tag 'efi-urgent' into x86/urgent
* More tweaking to the EFI variable anti-bricking algorithm. Quite a
few users were reporting boot regressions in v3.9. This has now been
fixed with a more accurate "minimum storage requirement to avoid
bricking" value from Samsung (5K instead of 50%) and code to trigger
garbage collection when we near our limit - Matthew Garrett.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Added callback registration for package threshold reports. Also added
a callback to check the rate control implemented in callback or not.
If there is no rate control implemented, then there is a default rate
control similar to core threshold notification by delaying for
CHECK_INTERVAL (5 minutes) between reports.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
... to void * like the boot services and lose all the void * casts. No
functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
This patch reworks the UEFI anti-bricking code, including an effective
reversion of cc5a080c and 31ff2f20. It turns out that calling
QueryVariableInfo() from boot services results in some firmware
implementations jumping to physical addresses even after entering virtual
mode, so until we have 1:1 mappings for UEFI runtime space this isn't
going to work so well.
Reverting these gets us back to the situation where we'd refuse to create
variables on some systems because they classify deleted variables as "used"
until the firmware triggers a garbage collection run, which they won't do
until they reach a lower threshold. This results in it being impossible to
install a bootloader, which is unhelpful.
Feedback from Samsung indicates that the firmware doesn't need more than
5KB of storage space for its own purposes, so that seems like a reasonable
threshold. However, there's still no guarantee that a platform will attempt
garbage collection merely because it drops below this threshold. It seems
that this is often only triggered if an attempt to write generates a
genuine EFI_OUT_OF_RESOURCES error. We can force that by attempting to
create a variable larger than the remaining space. This should fail, but if
it somehow succeeds we can then immediately delete it.
I've tested this on the UEFI machines I have available, but I don't have
a Samsung and so can't verify that it avoids the bricking problem.
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
Signed-off-by: Lee, Chun-Y <jlee@suse.com> [ dummy variable cleanup ]
Cc: <stable@vger.kernel.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Reimplement FPU detection code in C and drop old, not-so-recommended
detection method in asm. Move all the relevant stuff into i387.c where
it conceptually belongs. Finally drop cpuinfo_x86.hard_math.
[ hpa: huge thanks to Borislav for taking my original concept patch
and productizing it ]
[ Boris, note to self: do not use static_cpu_has before alternatives! ]
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1367244262-29511-2-git-send-email-bp@alien8.de
Link: http://lkml.kernel.org/r/1365436666-9837-2-git-send-email-bp@alien8.de
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
As Marcelo pointed out that
| "(retention of large number of pages while zapping)
| can be fatal, it can lead to OOM and host crash"
We introduce a list, kvm->arch.zapped_obsolete_pages, to link all
the pages which are deleted from the mmu cache but not actually
freed. When page reclaiming is needed, we always zap this kind of
pages first.
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability
In this patch, we introduce a faster way to invalidate all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created
When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.
Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are zapped by using lock-break technique
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Remove the extra tab in __flush_tlb_one().
CC: Alex Shi <alex.shi@intel.com>
CC: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/51AD8902.60603@linux.vnet.ibm.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Fix section mismatch warnings on microcode_amd_early.
Compile error occurs when CONFIG_MICROCODE=m, change so that early
loading depends on microcode_core.
Reported-by: Yinghai Lu <yinghai@kernel.org>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/20130531150241.GA12006@jshin-Toonie
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
It being static causes over a dozen instances to be scattered
across the kernel image, with non of them ever being referenced
in any way. Making the variable extern without ever defining it
works as well - all we need is to have the compiler think the
variable is being accessed.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/51A610B802000078000D99A0@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The vsyscall related pvclock entries can only ever be used on
x86-64, and hence they shouldn't even get allocated for 32-bit
kernels (the more that it is there where address space is
relatively precious).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Link: http://lkml.kernel.org/r/51A60F1F02000078000D997C@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation work for early loading, refactor some common functions
that will be shared, and move some struct defines to a common header file.
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/1369940959-2077-4-git-send-email-jacob.shin@amd.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Currently save_microcode_in_initrd() is declared in vendor neutural
microcode.h file, but defined in vendor specific
microcode_intel_early.c file. Vendor abstract it out to
microcode_core_early.c with a wrapper function.
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/1369940959-2077-3-git-send-email-jacob.shin@amd.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Several drivers currently use mtrr_add through various #ifdef guards
and/or drm wrappers. The vast majority of them want to add WC MTRRs
on x86 systems and don't actually need the MTRR if PAT (i.e.
ioremap_wc, etc) are working.
arch_phys_wc_add and arch_phys_wc_del are new functions, available
on all architectures and configurations, that add WC MTRRs on x86 if
needed (and handle errors) and do nothing at all otherwise. They're
also easier to use than mtrr_add and mtrr_del, so the call sites can
be simplified.
As an added benefit, this will avoid wasting MTRRs and possibly
warning pointlessly on PAT-supporting systems.
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Dave Airlie <airlied@redhat.com>
binutils prior to 2.18 (e.g. the ones found on SLE10) don't support
assembling PEXTRD, so a macro based approach like the one for PCLMULQDQ
in the same file should be used.
This requires making the helper macros capable of recognizing 32-bit
general purpose register operands.
[ hpa: tagging for stable as it is a low risk build fix ]
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/51A6142A02000078000D99D8@nat28.tlf.novell.com
Cc: Alexander Boyko <alexander_boyko@xyratex.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <stable@vger.kernel.org> v3.9
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The patch "x86: Increase precision of x86_platform.get/set_wallclock"
changed the x86 platform set_wallclock/get_wallclock interfaces to
use nsec granular timespecs instead of a second granular interface.
However, that patch missed converting the vrtc code, so this patch
converts those functions to use timespecs.
Many thanks to the kbuild test robot for finding this!
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
All the virtualized platforms (KVM, lguest and Xen) have persistent
wallclocks that have more than one second of precision.
read_persistent_wallclock() and update_persistent_wallclock() allow
for nanosecond precision but their implementation on x86 with
x86_platform.get/set_wallclock() only allows for one second precision.
This means guests may see a wallclock time that is off by up to 1
second.
Make set_wallclock() and get_wallclock() take a struct timespec
parameter (which allows for nanosecond precision) so KVM and Xen
guests may start with a more accurate wallclock time and a Xen dom0
can maintain a more accurate wallclock for guests.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The only reason uaccess routines might sleep
is if they fault. Make this explicit.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1369577426-26721-9-git-send-email-mst@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While porting Vince's perf overflow tests I found perf event
breakpoint overflow does not work properly.
I found the x86 RF EFLAG bit not being set when returning
from debug exception after triggering signal handler. Which
is exactly what you get when you set perf breakpoint overflow
SIGIO handler.
This patch and the next two patches fix the underlying bugs.
This patch adds the RF EFLAGS bit to be restored on return from
signal from the original register context before the signal was
entered.
This will prevent the RF flag to disappear when returning
from exception due to the signal handler being executed.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Originally-Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Stephane Eranian <eranian@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1367421944-19082-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is sometimes very helpful to be able to pinpoint the location which
causes a double fault before it turns into a triple fault and the
machine reboots. We have this for 32-bit already so extend it to 64-bit.
On 64-bit we get the register snapshot at #DF time and not from the
first exception which actually causes the #DF. It should be close
enough, though.
[ hpa: and definitely better than nothing, which is what we have now. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1368093749-31296-1-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Pull idle update from Len Brown:
"Add support for new Haswell-ULT CPU idle power states"
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
intel_idle: initial C8, C9, C10 support
tools/power turbostat: display C8, C9, C10 residency
Pull stray syscall bits from Al Viro:
"Several syscall-related commits that were missing from the original"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
switch compat_sys_sysctl to COMPAT_SYSCALL_DEFINE
unicore32: just use mmap_pgoff()...
unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE
x86, vm86: fix VM86 syscalls: use SYSCALL_DEFINEx(...)
Merge rwsem optimizations from Michel Lespinasse:
"These patches extend Alex Shi's work (which added write lock stealing
on the rwsem slow path) in order to provide rwsem write lock stealing
on the fast path (that is, without taking the rwsem's wait_lock).
I have unfortunately been unable to push this through -next before due
to Ingo Molnar / David Howells / Peter Zijlstra being busy with other
things. However, this has gotten some attention from Rik van Riel and
Davidlohr Bueso who both commented that they felt this was ready for
v3.10, and Ingo Molnar has said that he was OK with me pushing
directly to you. So, here goes :)
Davidlohr got the following test results from pgbench running on a
quad-core laptop:
| db_size | clients | tps-vanilla | tps-rwsem |
+---------+----------+----------------+--------------+
| 160 MB | 1 | 5803 | 6906 | + 19.0%
| 160 MB | 2 | 13092 | 15931 |
| 160 MB | 4 | 29412 | 33021 |
| 160 MB | 8 | 32448 | 34626 |
| 160 MB | 16 | 32758 | 33098 |
| 160 MB | 20 | 26940 | 31343 | + 16.3%
| 160 MB | 30 | 25147 | 28961 |
| 160 MB | 40 | 25484 | 26902 |
| 160 MB | 50 | 24528 | 25760 |
------------------------------------------------------
| 1.6 GB | 1 | 5733 | 7729 | + 34.8%
| 1.6 GB | 2 | 9411 | 19009 | + 101.9%
| 1.6 GB | 4 | 31818 | 33185 |
| 1.6 GB | 8 | 33700 | 34550 |
| 1.6 GB | 16 | 32751 | 33079 |
| 1.6 GB | 20 | 30919 | 31494 |
| 1.6 GB | 30 | 28540 | 28535 |
| 1.6 GB | 40 | 26380 | 27054 |
| 1.6 GB | 50 | 25241 | 25591 |
------------------------------------------------------
| 7.6 GB | 1 | 5779 | 6224 |
| 7.6 GB | 2 | 10897 | 13611 | + 24.9%
| 7.6 GB | 4 | 32683 | 33108 |
| 7.6 GB | 8 | 33968 | 34712 |
| 7.6 GB | 16 | 32287 | 32895 |
| 7.6 GB | 20 | 27770 | 31689 | + 14.1%
| 7.6 GB | 30 | 26739 | 29003 |
| 7.6 GB | 40 | 24901 | 26683 |
| 7.6 GB | 50 | 17115 | 25925 | + 51.5%
------------------------------------------------------
(Davidlohr also has one additional patch which further improves
throughput, though I will ask him to send it directly to you as I have
suggested some minor changes)."
* emailed patches from Michel Lespinasse <walken@google.com>:
rwsem: no need for explicit signed longs
x86 rwsem: avoid taking slow path when stealing write lock
rwsem: do not block readers at head of queue if other readers are active
rwsem: implement support for write lock stealing on the fastpath
rwsem: simplify __rwsem_do_wake
rwsem: skip initial trylock in rwsem_down_write_failed
rwsem: avoid taking wait_lock in rwsem_down_write_failed
rwsem: use cmpxchg for trying to steal write lock
rwsem: more agressive lock stealing in rwsem_down_write_failed
rwsem: simplify rwsem_down_write_failed
rwsem: simplify rwsem_down_read_failed
rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed
rwsem: shorter spinlocked section in rwsem_down_failed_common()
rwsem: make the waiter type an enumeration rather than a bitmask
modify __down_write[_nested] and __down_write_trylock to grab the write
lock whenever the active count is 0, even if there are queued waiters
(they must be writers pending wakeup, since the active count is 0).
Note that this is an optimization only; architectures without this
optimization will still work fine:
- __down_write() would take the slow path which would take the wait_lock
and then try stealing the lock (as in the spinlocked rwsem implementation)
- __down_write_trylock() would fail, but callers must be ready to deal
with that - since there are some writers pending wakeup, they could
have raced with us and obtained the lock before we steal it.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The updates are mostly about the x86 IOMMUs this time. Exceptions are
the groundwork for the PAMU IOMMU from Freescale (for a PPC platform)
and an extension to the IOMMU group interface. On the x86 side this
includes a workaround for VT-d to disable interrupt remapping on broken
chipsets. On the AMD-Vi side the most important new feature is a kernel
command-line interface to override broken information in IVRS ACPI
tables and get interrupt remapping working this way. Besides that there
are small fixes all over the place.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRh2vAAAoJECvwRC2XARrjbjkP/jzKzeffybUpQsIJF8rs/IEt
hSwqpGLr6WR5FdneEH9fiBIp4pyMDXmuAb/2ZNgB+DgPN3xgqmWVo4WLk7pMo3BS
/xIz/lu7hIX3AtKt807pL9+rPdhGYEJ43Vmr4bW9x0l1kuNXy6fmMLcN5FaPKjV4
p4hY4jOstEgtYQw4wi39/9b4FsYoipZizkOUSdtCzWwTv7jOHH7/Wra8iZyzL6Je
1VlF/efp0ytTcwLdHOfGwPCIlZrQRtQCM4SqdAUG9bOL3ARR9Yu/0iW1295nbLzo
CQX5CfKePvo/fGxki1jcBi+UCyxYKPosB5kCxmh4MAxCg/VzzMsaME/A73tLJa6W
Y29bbjwPoBPMq03HX8S9R5QWY8HpFujUUp+J4TXcKuTgYEV28WfLu1uaeKD716nM
LoXUojov7Cj8ZQZnhyu5l+XNaephBZLfw/8bM6bAxhlKXwAjmLiS5Z+srPl1GJee
5GCV+L94JifHLZaREWh3JFsh9O3W7Wno2++c4JU32uCWJHXH7tMgs2P8n5AY9rnT
Km1a9y6w2MF3Gg9j4y6u75m0XnFTNzYjeJMUtqVlwVhNHhgaXfuIWY63xOQCLJs1
ThTHOjoh0VqONGobR/ywn+0ouo9X07DnWpluyFd+zY3XK0UE0NOu9XMr4i6TWxOf
mlzWoEKxtw36XGHB/FtQ
=MVc/
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
"The updates are mostly about the x86 IOMMUs this time.
Exceptions are the groundwork for the PAMU IOMMU from Freescale (for a
PPC platform) and an extension to the IOMMU group interface.
On the x86 side this includes a workaround for VT-d to disable
interrupt remapping on broken chipsets. On the AMD-Vi side the most
important new feature is a kernel command-line interface to override
broken information in IVRS ACPI tables and get interrupt remapping
working this way.
Besides that there are small fixes all over the place."
* tag 'iommu-updates-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (24 commits)
iommu/tegra: Fix printk formats for dma_addr_t
iommu: Add a function to find an iommu group by id
iommu/vt-d: Remove warning for HPET scope type
iommu: Move swap_pci_ref function to drivers/iommu/pci.h.
iommu/vt-d: Disable translation if already enabled
iommu/amd: fix error return code in early_amd_iommu_init()
iommu/AMD: Per-thread IOMMU Interrupt Handling
iommu: Include linux/err.h
iommu/amd: Workaround for ERBT1312
iommu/amd: Document ivrs_ioapic and ivrs_hpet parameters
iommu/amd: Don't report firmware bugs with cmd-line ivrs overrides
iommu/amd: Add ioapic and hpet ivrs override
iommu/amd: Add early maps for ioapic and hpet
iommu/amd: Extend IVRS special device data structure
iommu/amd: Move add_special_device() to __init
iommu: Fix compile warnings with forward declarations
iommu/amd: Properly initialize irq-table lock
iommu/amd: Use AMD specific data structure for irq remapping
iommu/amd: Remove map_sg_no_iommu()
iommu/vt-d: add quirk for broken interrupt remapping on 55XX chipsets
...
Pull kvm updates from Gleb Natapov:
"Highlights of the updates are:
general:
- new emulated device API
- legacy device assignment is now optional
- irqfd interface is more generic and can be shared between arches
x86:
- VMCS shadow support and other nested VMX improvements
- APIC virtualization and Posted Interrupt hardware support
- Optimize mmio spte zapping
ppc:
- BookE: in-kernel MPIC emulation with irqfd support
- Book3S: in-kernel XICS emulation (incomplete)
- Book3S: HV: migration fixes
- BookE: more debug support preparation
- BookE: e6500 support
ARM:
- reworking of Hyp idmaps
s390:
- ioeventfd for virtio-ccw
And many other bug fixes, cleanups and improvements"
* tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
kvm: Add compat_ioctl for device control API
KVM: x86: Account for failing enable_irq_window for NMI window request
KVM: PPC: Book3S: Add API for in-kernel XICS emulation
kvm/ppc/mpic: fix missing unlock in set_base_addr()
kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write
kvm/ppc/mpic: remove users
kvm/ppc/mpic: fix mmio region lists when multiple guests used
kvm/ppc/mpic: remove default routes from documentation
kvm: KVM_CAP_IOMMU only available with device assignment
ARM: KVM: iterate over all CPUs for CPU compatibility check
KVM: ARM: Fix spelling in error message
ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally
KVM: ARM: Fix API documentation for ONE_REG encoding
ARM: KVM: promote vfp_host pointer to generic host cpu context
ARM: KVM: add architecture specific hook for capabilities
ARM: KVM: perform HYP initilization for hotplugged CPUs
ARM: KVM: switch to a dual-step HYP init code
ARM: KVM: rework HYP page table freeing
ARM: KVM: enforce maximum size for identity mapped code
ARM: KVM: move to a KVM provided HYP idmap
...