IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
and converts the devices to regular devices. The sysdev drivers are
implemented as subsystem interfaces now.
After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.
Userspace relies on events and generic sysfs subsystem infrastructure
from sysdev devices, which are made available with this conversion.
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Len Brown <lenb@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* master: (848 commits)
SELinux: Fix RCU deref check warning in sel_netport_insert()
binary_sysctl(): fix memory leak
mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
ipmi_watchdog: restore settings when BMC reset
oom: fix integer overflow of points in oom_badness
memcg: keep root group unchanged if creation fails
nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
nilfs2: unbreak compat ioctl
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
evm: prevent racing during tfm allocation
evm: key must be set once during initialization
mmc: vub300: fix type of firmware_rom_wait_states module parameter
Revert "mmc: enable runtime PM by default"
mmc: sdhci: remove "state" argument from sdhci_suspend_host
x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
IB/qib: Correct sense on freectxts increment and decrement
RDMA/cma: Verify private data length
cgroups: fix a css_set not found bug in cgroup_attach_proc
oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
...
Conflicts:
kernel/cgroup_freezer.c
Mathieu Desnoyers pointed out a case that can cause issues with
NMIs running on the debug stack:
int3 -> interrupt -> NMI -> int3
Because the interrupt changes the stack, the NMI will not see that
it preempted the debug stack. Looking deeper at this case,
interrupts only happen when the int3 is from userspace or in
an a location in the exception table (fixup).
userspace -> int3 -> interurpt -> NMI -> int3
All other int3s that happen in the kernel should be processed
without ever enabling interrupts, as the do_trap() call will
panic the kernel if it is called to process any other location
within the kernel.
Adding a counter around the sections that enable interrupts while
using the debug stack allows the NMI to also check that case.
If the NMI sees that it either interrupted a task using the debug
stack or the debug counter is non-zero, then it will have to
change the IDT table to make the int3 not change stacks (which will
corrupt the stack if it does).
Note, I had to move the debug_usage functions out of processor.h
and into debugreg.h because of the static inlined functions to
inc and dec the debug_usage counter. __get_cpu_var() requires
smp.h which includes processor.h, and would fail to build.
Link: http://lkml.kernel.org/r/1323976535.23971.112.camel@gandalf.stny.rr.com
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With i386, NMIs and breakpoints use the current stack and they
do not reset the stack pointer to a fix point that might corrupt
a previous NMI or breakpoint (as it does in x86_64). But NMIs are
still not made to be re-entrant, and need to prevent the case that
an NMI hitting a breakpoint (which does an iret), doesn't allow
another NMI to run.
The fix is to let the NMI be in 3 different states:
1) not running
2) executing
3) latched
When no NMI is executing on a given CPU, the state is "not running".
When the first NMI comes in, the state is switched to "executing".
On exit of that NMI, a cmpxchg is performed to switch the state
back to "not running" and if that fails, the NMI is restarted.
If a breakpoint is hit and does an iret, which re-enables NMIs,
and another NMI comes in before the first NMI finished, it will
detect that the state is not in the "not running" state and the
current NMI is nested. In this case, the state is switched to "latched"
to let the interrupted NMI know to restart the NMI handler, and
the nested NMI exits without doing anything.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We want to allow NMI handlers to have breakpoints to be able to
remove stop_machine from ftrace, kprobes and jump_labels. But if
an NMI interrupts a current breakpoint, and then it triggers a
breakpoint itself, it will switch to the breakpoint stack and
corrupt the data on it for the breakpoint processing that it
interrupted.
Instead, have the NMI check if it interrupted breakpoint processing
by checking if the stack that is currently used is a breakpoint
stack. If it is, then load a special IDT that changes the IST
for the debug exception to keep the same stack in kernel context.
When the NMI is done, it puts it back.
This way, if the NMI does trigger a breakpoint, it will keep
using the same stack and not stomp on the breakpoint data for
the breakpoint it interrupted.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In x86, when an NMI goes off, the CPU goes into an NMI context that
prevents other NMIs to trigger on that CPU. If an NMI is suppose to
trigger, it has to wait till the previous NMI leaves NMI context.
At that time, the next NMI can trigger (note, only one more NMI will
trigger, as only one can be latched at a time).
The way x86 gets out of NMI context is by calling iret. The problem
with this is that this causes problems if the NMI handle either
triggers an exception, or a breakpoint. Both the exception and the
breakpoint handlers will finish with an iret. If this happens while
in NMI context, the CPU will leave NMI context and a new NMI may come
in. As NMI handlers are not made to be re-entrant, this can cause
havoc with the system, not to mention, the nested NMI will write
all over the previous NMI's stack.
Linus Torvalds proposed the following workaround to this problem:
https://lkml.org/lkml/2010/7/14/264
"In fact, I wonder if we couldn't just do a software NMI disable
instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
allocated statically) that points to the NMI stack frame, and just
make the NMI code itself do something like
NMI entry:
- load percpu NMI stack frame pointer
- if non-zero we know we're nested, and should ignore this NMI:
- we're returning to kernel mode, so return immediately by using
"popf/ret", which also keeps NMI's disabled in the hardware until the
"real" NMI iret happens.
- before the popf/iret, use the NMI stack pointer to make the NMI
return stack be invalid and cause a fault
- set the NMI stack pointer to the current stack pointer
NMI exit (not the above "immediate exit because we nested"):
clear the percpu NMI stack pointer
Just do the iret.
Now, the thing is, now the "iret" is atomic. If we had a nested NMI,
we'll take a fault, and that re-does our "delayed" NMI - and NMI's
will stay masked.
And if we didn't have a nested NMI, that iret will now unmask NMI's,
and everything is happy."
I first tried to follow this advice but as I started implementing this
code, a few gotchas showed up.
One, is accessing per-cpu variables in the NMI handler.
The problem is that per-cpu variables use the %gs register to get the
variable for the given CPU. But as the NMI may happen in userspace,
we must first perform a SWAPGS to get to it. The NMI handler already
does this later in the code, but its too late as we have saved off
all the registers and we don't want to do that for a disabled NMI.
Peter Zijlstra suggested to keep all variables on the stack. This
simplifies things greatly and it has the added benefit of cache locality.
Two, faulting on the iret.
I really wanted to make this work, but it was becoming very hacky, and
I never got it to be stable. The iret already had a fault handler for
userspace faulting with bad segment registers, and getting NMI to trigger
a fault and detect it was very tricky. But for strange reasons, the system
would usually take a double fault and crash. I never figured out why
and decided to go with a simple "jmp" approach. The new approach I took
also simplified things.
Finally, the last problem with Linus's approach was to have the nested
NMI handler do a ret instead of an iret to give the first NMI NMI-context
again.
The problem is that ret is much more limited than an iret. I couldn't figure
out how to get the stack back where it belonged. I could have copied the
current stack, pushed the return onto it, but my fear here is that there
may be some place that writes data below the stack pointer. I know that
is not something code should depend on, but I don't want to chance it.
I may add this feature later, but for now, an NMI handler that loses NMI
context will not get it back.
Here's what is done:
When an NMI comes in, the HW pushes the interrupt stack frame onto the
per cpu NMI stack that is selected by the IST.
A special location on the NMI stack holds a variable that is set when
the first NMI handler runs. If this variable is set then we know that
this is a nested NMI and we process the nested NMI code.
There is still a race when this variable is cleared and an NMI comes
in just before the first NMI does the return. For this case, if the
variable is cleared, we also check if the interrupted stack is the
NMI stack. If it is, then we process the nested NMI code.
Why the two tests and not just test the interrupted stack?
If the first NMI hits a breakpoint and loses NMI context, and then it
hits another breakpoint and while processing that breakpoint we get a
nested NMI. When processing a breakpoint, the stack changes to the
breakpoint stack. If another NMI comes in here we can't rely on the
interrupted stack to be the NMI stack.
If the variable is not set and the interrupted task's stack is not the
NMI stack, then we know this is the first NMI and we can process things
normally. But in order to do so, we need to do a few things first.
1) Set the stack variable that tells us that we are in an NMI handler
2) Make two copies of the interrupt stack frame.
One copy is used to return on iret
The other is used to restore the first one if we have a nested NMI.
This is what the stack will look like:
+-------------------------+
| original SS |
| original Return RSP |
| original RFLAGS |
| original CS |
| original RIP |
+-------------------------+
| temp storage for rdx |
+-------------------------+
| NMI executing variable |
+-------------------------+
| Saved SS |
| Saved Return RSP |
| Saved RFLAGS |
| Saved CS |
| Saved RIP |
+-------------------------+
| copied SS |
| copied Return RSP |
| copied RFLAGS |
| copied CS |
| copied RIP |
+-------------------------+
| pt_regs |
+-------------------------+
The original stack frame contains what the HW put in when we entered
the NMI.
We store %rdx as a temp variable to use. Both the original HW stack
frame and this %rdx storage will be clobbered by nested NMIs so we
can not rely on them later in the first NMI handler.
The next item is the special stack variable that is set when we execute
the rest of the NMI handler.
Then we have two copies of the interrupt stack. The second copy is
modified by any nested NMIs to let the first NMI know that we triggered
a second NMI (latched) and that we should repeat the NMI handler.
If the first NMI hits an exception or breakpoint that takes it out of
NMI context, if a second NMI comes in before the first one finishes,
it will update the copied interrupt stack to point to a fix up location
to trigger another NMI.
When the first NMI calls iret, it will instead jump to the fix up
location. This fix up location will copy the saved interrupt stack back
to the copy and execute the nmi handler again.
Note, the nested NMI knows enough to check if it preempted a previous
NMI handler while it is in the fixup location. If it has, it will not
modify the copied interrupt stack and will just leave as if nothing
happened. As the NMI handle is about to execute again, there's no reason
to latch now.
To test all this, I forced the NMI handler to call iret and take itself
out of NMI context. I also added assemble code to write to the serial to
make sure that it hits the nested path as well as the fix up path.
Everything seems to be working fine.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Linus cleaned up the NMI handler but it still needs some comments to
explain why it uses save_paranoid but not paranoid_exit. Just to keep
others from adding that in the future, document why it's not used.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The NMI handler uses the paranoid_exit routine that checks the
NEED_RESCHED flag, and if it is set and the return is for userspace,
then interrupts are enabled, the stack is swapped to the thread's stack,
and schedule is called. The problem with this is that we are still in an
NMI context until an iret is executed. This means that any new NMIs are
now starved until an interrupt or exception occurs and does the iret.
As NMIs can not be masked and can interrupt any location, they are
treated as a special case. NEED_RESCHED should not be set in an NMI
handler. The interruption by the NMI should not disturb the work flow
for scheduling. Any IPI sent to a processor after sending the
NEED_RESCHED would have to wait for the NMI anyway, and after the IPI
finishes the schedule would be called as required.
There is no reason to do anything special leaving an NMI. Remove the
call to paranoid_exit and do a simple return. This not only fixes the
bug of starved NMIs, but it also cleans up the code.
Link: http://lkml.kernel.org/r/CA+55aFzgM55hXTs4griX5e9=v_O+=ue+7Rj0PTD=M7hFYpyULQ@mail.gmail.com
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add event maps for Intel x86 processors (with architected PMU v2 or later).
On AMD, there is frequency scaling but no Turbo. There is no core
cycle event not subject to frequency scaling, therefore we do not
provide a mapping.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1323559734-3488-4-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the encoding and definitions necessary for the
unhalted_reference_cycles event avaialble since Intel Core 2 processors.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1323559734-3488-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Several fields in struct cpuinfo_x86 were not defined for the
!SMP case, likely to save space. However, those fields still
have some meaning for UP, and keeping them allows some #ifdef
removal from other files. The additional size of the UP kernel
from this change is not significant enough to worry about
keeping up the distinction:
text data bss dec hex filename
4737168 506459 972040 6215667 5ed7f3 vmlinux.o.before
4737444 506459 972040 6215943 5ed907 vmlinux.o.after
for a difference of 276 bytes for an example UP config.
If someone wants those 276 bytes back badly then it should
be implemented in a cleaner way.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Cc: Steffen Persvold <sp@numascale.com>
Link: http://lkml.kernel.org/r/1324428742-12498-1-git-send-email-kjwinchester@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* commit 'v3.2-rc3': (412 commits)
Linux 3.2-rc3
virtio-pci: make reset operation safer
virtio-mmio: Correct the name of the guest features selector
virtio: add HAS_IOMEM dependency to MMIO platform bus driver
eCryptfs: Extend array bounds for all filename chars
eCryptfs: Flush file in vma close
eCryptfs: Prevent file create race condition
regulator: TPS65910: Fix VDD1/2 voltage selector count
i2c: Make i2cdev_notifier_call static
i2c: Delete ANY_I2C_BUS
i2c: Fix device name for 10-bit slave address
i2c-algo-bit: Generate correct i2c address sequence for 10-bit target
drm: integer overflow in drm_mode_dirtyfb_ioctl()
Revert "of/irq: of_irq_find_parent: check for parent equal to child"
drivers/gpu/vga/vgaarb.c: add missing kfree
drm/radeon/kms/atom: unify i2c gpio table handling
drm/radeon/kms: fix up gpio i2c mask bits for r4xx for real
ttm: Don't return the bo reserved on error path
mount_subtree() pointless use-after-free
iio: fix a leak due to improper use of anon_inode_getfd()
...
Since LRW & XTS are selected by serpent-sse2, we don't need these #ifdefs
anymore.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since LRW & XTS are selected by twofish-x86_64-3way, we don't need these
#ifdefs anymore.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When printing the code bytes in show_registers(), the markers around the
byte at the fault address could make the printk() format string look
like a valid log level and facility code. This would prevent this byte
from being printed and result in a spurious newline:
[ 7555.765589] Code: 8b 32 e9 94 00 00 00 81 7d 00 ff 00 00 00 0f 87 96 00 00 00 48 8b 83 c0 00 00 00 44 89 e2 44 89 e6 48 89 df 48 8b 80 d8 02 00 00
[ 7555.765683] 8b 48 28 48 89 d0 81 e2 ff 0f 00 00 48 c1 e8 0c 48 c1 e0 04
Add KERN_CONT where needed, and elsewhere in show_registers() for
consistency.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Link: http://lkml.kernel.org/r/4EEFA7AE.9020407@ladisch.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
x86 jump instruction size is 2 or 5 bytes (near/long jump), not 2 or 6
bytes.
In case a conditional jump is followed by a long jump, conditional jump
target is one byte past the start of target instruction.
Signed-off-by: Markus Kötter <nepenthesdev@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LAPIC related statistics are grouped inside the per-cpu
structure irq_stat, so there is no need for icr_read_retry_count
to be a standalone per-cpu variable.
This patch moves icr_read_retry_count to where it belongs.
Suggested-y: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Cc: Jörn Engel <joern@logfs.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Added additional debug output that we always seem to add
during power ons to validate firmware operation.
Signed-off-by: Michael Demeter <michael.demeter@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/20111215223116.10166.50803.stgit@bob.linux.org.uk
[ fixed line breaks, formatting and commit title. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
gcc noticed (when using -Wempty-body) that our use of
lock_cmos() and unlock_cmos() in
arch/x86/include/asm/mach_traps.h is potentially problematic :
arch/x86/include/asm/mach_traps.h:32:15: warning: suggest braces around empty body in an ¡else¢ statement [-Wempty-body]
arch/x86/include/asm/mach_traps.h:40:16: warning: suggest braces around empty body in an ¡else¢ statement [-Wempty-body]
Let's just use the standard 'do {} while (0)' solution. That
shuts up gcc and also prevents future problems if the macros
should end up being used in a similar situation elsewhere.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1112180103130.21784@swampdragon.chaosbits.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If one builds the kernel with -Wempty-body one gets this
warning:
mm/memory.c:3432:46: warning: suggest braces around empty body in an ¡if¢ statement [-Wempty-body]
due to the fact that 'flush_tlb_fix_spurious_fault' is a macro
that can sometimes be defined to nothing.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-mm@kvack.org
Cc: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1112180128070.21784@swampdragon.chaosbits.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
mce-inject provides a mechanism to simulate errors so that test
scripts can check for correct operation of the kernel without
requiring any specialized hardware to create rare events.
The existing code can simulate events in normal process context
and also in NMI context - but not in IRQ context. This patch
fills that gap.
Link: https://lkml.org/lkml/2011/12/7/537
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The efi boot stub tries to read the entire initrd in 1 go, however
some efi implementations hang if too much if asked to read too much
data at the same time. After some experimentation I found out that my
asrock p67 board will hang if asked to read chunks of 4MiB, so use a
safe value.
elilo reads in chunks of 16KiB, but since that requires many read
calls I use a value of 1 MiB. hpa suggested adding individual
blacklists for when systems are found where this value causes a crash.
Signed-off-by: Maarten Lankhorst <m.b.lankhorst@gmail.com>
Link: http://lkml.kernel.org/r/4EEB3A02.3090201@gmail.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Doh.. pass the brown paper bags - preferably filled with mince
pies..
This fixes occasional build failures.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-r0oc1knlvzuqr69artaeq8s8@git.kernel.org
[ extended the changelog a bit ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fls(N), ffs(N) and fls64(N) can be optimised on x86_64. Currently they use a
CMOV instruction after the BSR/BSF to set the destination register to -1 if the
value to be scanned was 0 (in which case BSR/BSF set the Z flag).
Instead, according to the AMD64 specification, we can make use of the fact that
BSR/BSF doesn't modify its output register if its input is 0. By preloading
the output with -1 and incrementing the result, we achieve the desired result
without the need for a conditional check.
The Intel x86_64 specification, however, says that the result of BSR/BSF in
such a case is undefined. That said, when queried, one of the Intel CPU
architects said that the behaviour on all Intel CPUs is that:
(1) with BSRQ/BSFQ, the 64-bit destination register is written with its
original value if the source is 0, thus, in essence, giving the effect we
want. And,
(2) with BSRL/BSFL, the lower half of the 64-bit destination register is
written with its original value if the source is 0, and the upper half is
cleared, thus giving us the effect we want (we return a 4-byte int).
Further, it was indicated that they (Intel) are unlikely to get away with
changing the behaviour.
It might be possible to optimise the 32-bit versions of these functions, but
there's a lot more variation, and so the effective non-destructive property of
BSRL/BSRF cannot be relied on.
[ hpa: specifically, some 486 chips are known to NOT have this property. ]
I have benchmarked these functions on my Core2 Duo test machine using the
following program:
#include <stdlib.h>
#include <stdio.h>
#ifndef __x86_64__
#error
#endif
#define PAGE_SHIFT 12
typedef unsigned long long __u64, u64;
typedef unsigned int __u32, u32;
#define noinline __attribute__((noinline))
static __always_inline int fls64(__u64 x)
{
long bitpos = -1;
asm("bsrq %1,%0"
: "+r" (bitpos)
: "rm" (x));
return bitpos + 1;
}
static inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}
static __always_inline int old_fls64(__u64 x)
{
if (x == 0)
return 0;
return __fls(x) + 1;
}
static noinline // __attribute__((const))
int old_get_order(unsigned long size)
{
int order;
size = (size - 1) >> (PAGE_SHIFT - 1);
order = -1;
do {
size >>= 1;
order++;
} while (size);
return order;
}
static inline __attribute__((const))
int get_order_old_fls64(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = old_fls64(size);
return order;
}
static inline __attribute__((const))
int get_order(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = fls64(size);
return order;
}
unsigned long prevent_optimise_out;
static noinline unsigned long test_old_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += old_get_order(n);
}
}
return total;
}
static noinline unsigned long test_get_order_old_fls64(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order_old_fls64(n);
}
}
return total;
}
static noinline unsigned long test_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order(n);
}
}
return total;
}
int main(int argc, char **argv)
{
unsigned long total;
switch (argc) {
case 1: total = test_old_get_order(); break;
case 2: total = test_get_order_old_fls64(); break;
default: total = test_get_order(); break;
}
prevent_optimise_out = total;
return 0;
}
This allows me to test the use of the old fls64() implementation and the new
fls64() implementation and also to contrast these to the out-of-line loop-based
implementation of get_order(). The results were:
warthog>time ./get_order
real 1m37.191s
user 1m36.313s
sys 0m0.861s
warthog>time ./get_order x
real 0m16.892s
user 0m16.586s
sys 0m0.287s
warthog>time ./get_order x x
real 0m7.731s
user 0m7.727s
sys 0m0.002s
Using the current upstream fls64() as a basis for an inlined get_order() [the
second result above] is much faster than using the current out-of-line
loop-based get_order() [the first result above].
Using my optimised inline fls64()-based get_order() [the third result above]
is even faster still.
[ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
instead of comparing BITS_PER_LONG, updated comments, rebased manually
on top of 83d99df7c4bf x86, bitops: Move fls64.h inside __KERNEL__ ]
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
We would include <asm-generic/bitops/fls64.h> even without __KERNEL__,
but that doesn't make sense, as:
1. That file provides fls64(), but the corresponding function fls() is
not exported to user space.
2. The implementation of fls64.h uses kernel-only symbols.
3. fls64.h is not exported to user space.
This appears to have been a bug introduced in checkin:
d57594c203b1 bitops: use __fls for fls64 on 64-bit archs
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Alexander van Heukelum <heukelum@mailshack.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/4EEA77E1.6050009@zytor.com
* 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/swiotlb: Use page alignment for early buffer allocation.
xen: only limit memory map to maximum reservation for domain 0.
d312ae878b6a "xen: use maximum reservation to limit amount of usable RAM"
clamped the total amount of RAM to the current maximum reservation. This is
correct for dom0 but is not correct for guest domains. In order to boot a guest
"pre-ballooned" (e.g. with memory=1G but maxmem=2G) in order to allow for
future memory expansion the guest must derive max_pfn from the e820 provided by
the toolstack and not the current maximum reservation (which can reflect only
the current maximum, not the guest lifetime max). The existing algorithm
already behaves this correctly if we do not artificially limit the maximum
number of pages for the guest case.
For a guest booted with maxmem=512, memory=128 this results in:
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable)
[ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
-[ 0.000000] Xen: 0000000000100000 - 0000000008100000 (usable)
-[ 0.000000] Xen: 0000000008100000 - 0000000020800000 (unusable)
+[ 0.000000] Xen: 0000000000100000 - 0000000020800000 (usable)
...
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] DMI not present or invalid.
[ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
[ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
-[ 0.000000] last_pfn = 0x8100 max_arch_pfn = 0x1000000
+[ 0.000000] last_pfn = 0x20800 max_arch_pfn = 0x1000000
[ 0.000000] initial memory mapped : 0 - 027ff000
[ 0.000000] Base memory trampoline at [c009f000] 9f000 size 4096
-[ 0.000000] init_memory_mapping: 0000000000000000-0000000008100000
-[ 0.000000] 0000000000 - 0008100000 page 4k
-[ 0.000000] kernel direct mapping tables up to 8100000 @ 27bb000-27ff000
+[ 0.000000] init_memory_mapping: 0000000000000000-0000000020800000
+[ 0.000000] 0000000000 - 0020800000 page 4k
+[ 0.000000] kernel direct mapping tables up to 20800000 @ 26f8000-27ff000
[ 0.000000] xen: setting RW the range 27e8000 - 27ff000
[ 0.000000] 0MB HIGHMEM available.
-[ 0.000000] 129MB LOWMEM available.
-[ 0.000000] mapped low ram: 0 - 08100000
-[ 0.000000] low ram: 0 - 08100000
+[ 0.000000] 520MB LOWMEM available.
+[ 0.000000] mapped low ram: 0 - 20800000
+[ 0.000000] low ram: 0 - 20800000
With this change "xl mem-set <domain> 512M" will successfully increase the
guest RAM (by reducing the balloon).
There is no change for dom0.
Reported-and-Tested-by: George Shuklin <george.shuklin@gmail.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: stable@kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
My box with following cpuinfo needs the cx8 enabling still:
vendor_id : CentaurHauls
cpu family : 6
model : 13
model name : VIA Eden Processor 1200MHz
stepping : 0
cpu MHz : 1199.940
cache size : 128 KB
This fixes valgrind to work on my box (it requires and checks
cx8 from cpuinfo).
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Link: http://lkml.kernel.org/r/1323961888-10223-1-git-send-email-timo.teras@iki.fi
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
They had several problems/shortcomings:
Only the first memory operand was mentioned in the 2x32bit asm()
operands, and 2x64-bit version had a memory clobber. The first
allowed the compiler to not recognize the need to re-load the
data in case it had it cached in some register, and the second
was overly destructive.
The memory operand in the 2x32-bit asm() was declared to only be
an output.
The types of the local copies of the old and new values were
incorrect (as in other per-CPU ops, the types of the per-CPU
variables accessed should be used here, to make sure the
respective types are compatible).
The __dummy variable was pointless (and needlessly initialized
in the 2x32-bit case), given that local copies of the inputs
already exist.
The 2x64-bit variant forced the address of the first object into
%rsi, even though this is needed only for the call to the
emulation function. The real cmpxchg16b can operate on an
memory.
At once also change the return value type to what it really is -
'bool'.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4EE86D6502000078000679FE@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add the flags to get rid of the [9] and [10] feature names
in cpuinfo's 'power management' fields and replace them with
meaningful names.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/1323875574-17881-1-git-send-email-joerg.roedel@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Thermal throttle and power limit events are not defined as MCE errors in x86
architecture and should not generate MCE errors in mcelog.
Current kernel generates fake software defined MCE errors for these events.
This may confuse users because they may think the machine has real MCE errors
while actually only thermal throttle or power limit events happen.
To make it worse, buggy firmware on some platforms may falsely generate
the events. Therefore, kernel reports MCE errors which users think as real
hardware errors. Although the firmware bugs should be fixed, on the other hand,
kernel should not report MCE errors either.
So mcelog is not a good mechanism to report these events. To report the events, we count them in respective counters (core_power_limit_count,
package_power_limit_count, core_throttle_count, and package_throttle_count) in
/sys/devices/system/cpu/cpu#/thermal_throttle/. Users can check the counters
for each event on each CPU. Please note that all CPU's on one package report
duplicate counters. It's user application's responsibity to retrieve a package
level counter for one package.
This patch doesn't report package level power limit, core level power limit, and
package level thermal throttle events in mcelog. When the events happen, only
report them in respective counters in sysfs.
Since core level thermal throttle has been legacy code in kernel for a while and
users accepted it as MCE error in mcelog, core level thermal throttle is still
reported in mcelog. In the mean time, the event is counted in a counter in sysfs
as well.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Borislav Petkov <bp@amd64.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20111215001945.GA21009@linux-os.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Add a function which drains whatever MCEs were logged in already during
boot and before the decoder chains were registered.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
No functionality change, this is done so that in a follow-on patch all
queued-up MCEs can be decoded after registering on the chain.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Once we've found and validated the ucode patch for the current CPU,
there's no need to iterate over the remaining patches in the binary
image. Exit then and save us a bunch of cycles.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Basically, what we did until now is take out a chunk of the firmware
image, vmalloc space for it and inspect it before application. And
repeat.
This patch changes all that so that we look at each ucode patch from
the firmware image, check it for sanity and copy it to local buffer for
application only once and if it passes all checks. Thus, vmalloc-ing for
each piece is gone, we can do proper size checking only of the patch
which is destined for the CPU of the current machine instead of each
single patch, which is clearly wrong.
Oh yeah, simplify and cleanup the code while at it, along with adding
comments as to what actually happens.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add a simple 4K page which gets allocated on driver init and freed on
driver exit instead of vmalloc'ing small buffers for each ucode patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
In the IPI delivery slow path (NMI delivery) we retry the ICR
read to check for delivery completion a limited number of times.
[ The reason for the limited retries is that some of the places
where it is used (cpu boot, kdump, etc) IPI delivery might not
succeed (due to a firmware bug or system crash, for example)
and in such a case it is better to give up and resume
execution of other code. ]
This patch adds a new entry to /proc/interrupts, RTR, which
tells user space the number of times we retried the ICR read in
the IPI delivery slow path.
This should give some insight into how well the APIC
message delivery hardware is working - if the counts are way
too large then we are hitting a (very-) slow path way too
often.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Cc: Jörn Engel <joern@logfs.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/n/tip-vzsp20lo2xdzh5f70g0eis2s@git.kernel.org
[ extended the changelog ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>