3254 Commits

Author SHA1 Message Date
Kevin Wolf
4cee4798a3 KVM: x86 emulator: Allow PM/VM86 switch during task switch
Task switches can switch between Protected Mode and VM86. The current
mode must be updated during the task switch emulation so that the new
segment selectors are interpreted correctly.

In order to let privilege checks succeed, rflags needs to be updated in
the vcpu struct as this causes a CPL update.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:29 +02:00
Kevin Wolf
7f3d35fddd KVM: x86 emulator: Fix task switch privilege checks
Currently, all task switches check privileges against the DPL of the
TSS. This is only correct for jmp/call to a TSS. If a task gate is used,
the DPL of this take gate is used for the check instead. Exceptions,
external interrupts and iret shouldn't perform any check.

[avi: kill kvm-kmod remnants]

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:26 +02:00
Takuya Yoshikawa
db3fe4eb45 KVM: Introduce kvm_memory_slot::arch and move lpage_info into it
Some members of kvm_memory_slot are not used by every architecture.

This patch is the first step to make this difference clear by
introducing kvm_memory_slot::arch;  lpage_info is moved into it.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:22 +02:00
Zachary Amsden
e26101b116 KVM: Track TSC synchronization in generations
This allows us to track the original nanosecond and counter values
at each phase of TSC writing by the guest.  This gets us perfect
offset matching for stable TSC systems, and perfect software
computed TSC matching for machines with unstable TSC.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:09 +02:00
Zachary Amsden
0dd6a6edb0 KVM: Dont mark TSC unstable due to S4 suspend
During a host suspend, TSC may go backwards, which KVM interprets
as an unstable TSC.  Technically, KVM should not be marking the
TSC unstable, which causes the TSC clocksource to go bad, but we
need to be adjusting the TSC offsets in such a case.

Dealing with this issue is a little tricky as the only place we
can reliably do it is before much of the timekeeping infrastructure
is up and running.  On top of this, we are not in a KVM thread
context, so we may not be able to safely access VCPU fields.
Instead, we compute our best known hardware offset at power-up and
stash it to be applied to all VCPUs when they actually start running.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:08 +02:00
Marcelo Tosatti
f1e2b26003 KVM: Allow adjust_tsc_offset to be in host or guest cycles
Redefine the API to take a parameter indicating whether an
adjustment is in host or guest cycles.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:07 +02:00
Zachary Amsden
6f526ec538 KVM: Add last_host_tsc tracking back to KVM
The variable last_host_tsc was removed from upstream code.  I am adding
it back for two reasons.  First, it is unnecessary to use guest TSC
computation to conclude information about the host TSC.  The guest may
set the TSC backwards (this case handled by the previous patch), but
the computation of guest TSC (and fetching an MSR) is significanlty more
work and complexity than simply reading the hardware counter.  In addition,
we don't actually need the guest TSC for any part of the computation,
by always recomputing the offset, we can eliminate the need to deal with
the current offset and any scaling factors that may apply.

The second reason is that later on, we are going to be using the host
TSC value to restore TSC offsets after a host S4 suspend, so we need to
be reading the host values, not the guest values here.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:06 +02:00
Zachary Amsden
5d3cb0f6a8 KVM: Improve TSC offset matching
There are a few improvements that can be made to the TSC offset
matching code.  First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.

Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates.  Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.

Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.

[avi: fix 64-bit division on i386]

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: Fix 64-bit division in kvm_write_tsc()

Breaks i386 build.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:03 +02:00
Zachary Amsden
cc578287e3 KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate.  Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code.  This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.

There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed.  Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used.  The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.

In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.

[avi: fix 64-bit division on i386]

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:09:35 +02:00
H. Peter Anvin
e7084fd52e x32: Switch to a 64-bit clock_t
clock_t is used mainly to give the number of jiffies a certain process
has burned.  It is entirely feasible for a long-running process to
consume more than 2^32 jiffies especially in a multiprocess system.
As such, switch to a 64-bit clock_t for x32, just as we already
switched to a 64-bit time_t.

clock_t is only used in a handful of places, and as such it is really
not a very significant change.  The one that has the biggest impact is
in struct siginfo, but since the *size* of struct siginfo doesn't
change (it is padded to the hilt) it is fairly easy to make this a
localized change.

This also gets rid of sys_x32_times, however since this is a pretty
late change don't compactify the system call numbers; we can reuse
system call slot 521 next time we need an x32 system call.

Reported-by: Gregory M. Lueck <gregory.m.lueck@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: H. J. Lu <hjl.tools@gmail.com>
Link: http://lkml.kernel.org/r/1329696488-16970-1-git-send-email-hpa@zytor.com
2012-03-05 15:35:18 -08:00
H. Peter Anvin
a628b684d2 x32: Provide separate is_ia32_task() and is_x32_task() predicates
The is_compat_task() test is composed of two predicates already, so
make each of them available separately.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: H. J. Lu <hjl.tools@gmail.com>
Link: http://lkml.kernel.org/r/1329696488-16970-1-git-send-email-hpa@zytor.com
2012-03-05 15:35:18 -08:00
Stephane Eranian
225ce53910 perf/x86: Add Intel LBR MSR definitions
This patch adds the LBR definitions for NHM/WSM/SNB and Core.
It also adds the definitions for the architected LBR MSR:
LBR_SELECT, LBRT_TOS.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:39 +01:00
Marcelo Tosatti
a59cb29e4d KVM: x86: increase recommended max vcpus to 160
Increase recommended max vcpus from 64 to 160 (tested internally
at Red Hat).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:57:34 +02:00
Igor Mammedov
df156f90a0 x86: Introduce x86_cpuinit.early_percpu_clock_init hook
When kvm guest uses kvmclock, it may hang on vcpu hot-plug.
This is caused by an overflow in pvclock_get_nsec_offset,

    u64 delta = tsc - shadow->tsc_timestamp;

which in turn is caused by an undefined values from percpu
hv_clock that hasn't been initialized yet.
Uninitialized clock on being booted cpu is accessed from
   start_secondary
    -> smp_callin
      ->  smp_store_cpu_info
        -> identify_secondary_cpu
          -> mtrr_ap_init
            -> mtrr_restore
              -> stop_machine_from_inactive_cpu
                -> queue_stop_cpus_work
                  ...
                    -> sched_clock
                      -> kvm_clock_read
which is well before x86_cpuinit.setup_percpu_clockev call in
start_secondary, where percpu clock is initialized.

This patch introduces a hook that allows to setup/initialize
per_cpu clock early and avoid overflow due to reading
  - undefined values
  - old values if cpu was offlined and then onlined again

Another possible early user of this clock source is ftrace that
accesses it to get timestamps for ring buffer entries. So if
mtrr_ap_init is moved from identify_secondary_cpu to past
x86_cpuinit.setup_percpu_clockev in start_secondary, ftrace
may cause the same overflow/hang on cpu hot-plug anyway.

More complete description of the problem:
  https://lkml.org/lkml/2012/2/2/101

Credits to Marcelo Tosatti <mtosatti@redhat.com> for hook idea.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:57:32 +02:00
Takuya Yoshikawa
3ea8b75e47 KVM: MMU: Remove unused kvm_pte_chain
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:42 +02:00
Christian Borntraeger
b9e5dc8d45 KVM: provide synchronous registers in kvm_run
On some cpus the overhead for virtualization instructions is in the same
range as a system call. Having to call multiple ioctls to get set registers
will make certain userspace handled exits more expensive than necessary.
Lets provide a section in kvm_run that works as a shared save area
for guest registers.
We also provide two 64bit flags fields (architecture specific), that will
specify
1. which parts of these fields are valid.
2. which registers were modified by userspace

Each bit for these flag fields will define a group of registers (like
general purpose) or a single register.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:22 +02:00
Boris Ostrovsky
2b036c6b86 KVM: SVM: Add support for AMD's OSVW feature in guests
In some cases guests should not provide workarounds for errata even when the
physical processor is affected. For example, because of erratum 400 on family
10h processors a Linux guest will read an MSR (resulting in VMEXIT) before
going to idle in order to avoid getting stuck in a non-C0 state. This is not
necessary: HLT and IO instructions are intercepted and therefore there is no
reason for erratum 400 workaround in the guest.

This patch allows us to present a guest with certain errata as fixed,
regardless of the state of actual hardware.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:21 +02:00
Ingo Molnar
737f24bda7 Merge branch 'perf/urgent' into perf/core
Conflicts:
	tools/perf/builtin-record.c
	tools/perf/builtin-top.c
	tools/perf/perf.h
	tools/perf/util/top.h

Merge reason: resolve these cherry-picking conflicts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 09:20:08 +01:00
Paul Gortmaker
187f1882b5 BUG: headers with BUG/BUG_ON etc. need linux/bug.h
If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
other BUG variant in a static inline (i.e. not in a #define) then
that header really should be including <linux/bug.h> and not just
expecting it to be implicitly present.

We can make this change risk-free, since if the files using these
headers didn't have exposure to linux/bug.h already, they would have
been causing compile failures/warnings.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-03-04 17:54:34 -05:00
Joerg Roedel
1018faa6cf perf/x86/kvm: Fix Host-Only/Guest-Only counting with SVM disabled
It turned out that a performance counter on AMD does not
count at all when the GO or HO bit is set in the control
register and SVM is disabled in EFER.

This patch works around this issue by masking out the HO bit
in the performance counter control register when SVM is not
enabled.

The GO bit is not touched because it is only set when the
user wants to count in guest-mode only. So when SVM is
disabled the counter should not run at all and the
not-counting is the intended behaviour.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Avi Kivity <avi@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: stable@vger.kernel.org # v3.2
Link: http://lkml.kernel.org/r/1330523852-19566-1-git-send-email-joerg.roedel@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-02 12:16:39 +01:00
H. Peter Anvin
b263b31e8a x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
Specify the data structures for the 64-bit ioctls with explicit sizing
and padding so that the x32 kernel will correctly use the 64-bit forms
of these ioctls.  Note that these ioctls are bogus in both forms on
both 32 and 64 bits; even on 64 bits the maximum MTRR size is only 44
bits long.

Note that nothing really is supposed to use these ioctls and that the
preferred interface is text strings on /proc/mtrr, or better yet,
nothing at all (use /sys/bus/pci/devices/*/resource*_wc for write
combining; that uses PAT not MTRRs.)

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: H. J. Lu <hjl.tools@gmail.com>
Tested-by: Nitin A. Kamble <nitin.a.kamble@intel.com>
Link: http://lkml.kernel.org/n/tip-vwvnlu3hjmtkwvij4qxtm90l@git.kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-03-01 12:48:52 -08:00
Paul Gortmaker
f649e9388c x86: relocate get/set debugreg fcns to include/asm/debugreg.
Since we already have a debugreg.h header file, move the
assoc. get/set functions to it.  In addition to it being the
logical home for them, it has a secondary advantage.  The
functions that are moved use BUG().  So we really need to
have linux/bug.h in scope.  But asm/processor.h is used about
600 times, vs. only about 15 for debugreg.h -- so adding bug.h
to the latter reduces the amount of time we'll be processing
it during a compile.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
2012-02-28 17:48:04 -05:00
Ingo Molnar
458ce2910a Merge branch 'linus' into x86/asm
Sync up the latest NMI fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-28 10:27:36 +01:00
Jan Beulich
d93c4071b7 x86/time: Eliminate unused irq0_irqs counter
As of v2.6.38 this counter is being maintained without ever being
read.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4F4787930200007800074A10@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-27 08:46:25 +01:00
Ingo Molnar
c5905afb0e static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
So here's a boot tested patch on top of Jason's series that does
all the cleanups I talked about and turns jump labels into a
more intuitive to use facility. It should also address the
various misconceptions and confusions that surround jump labels.

Typical usage scenarios:

        #include <linux/static_key.h>

        struct static_key key = STATIC_KEY_INIT_TRUE;

        if (static_key_false(&key))
                do unlikely code
        else
                do likely code

Or:

        if (static_key_true(&key))
                do likely code
        else
                do unlikely code

The static key is modified via:

        static_key_slow_inc(&key);
        ...
        static_key_slow_dec(&key);

The 'slow' prefix makes it abundantly clear that this is an
expensive operation.

I've updated all in-kernel code to use this everywhere. Note
that I (intentionally) have not pushed through the rename
blindly through to the lowest levels: the actual jump-label
patching arch facility should be named like that, so we want to
decouple jump labels from the static-key facility a bit.

On non-jump-label enabled architectures static keys default to
likely()/unlikely() branches.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: a.p.zijlstra@chello.nl
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-24 10:05:59 +01:00
Olof Johansson
1adbfa3511 x86, efi: Allow basic init with mixed 32/64-bit efi/kernel
Traditionally the kernel has refused to setup EFI at all if there's been
a mismatch in 32/64-bit mode between EFI and the kernel.

On some platforms that boot natively through EFI (Chrome OS being one),
we still need to get at least some of the static data such as memory
configuration out of EFI. Runtime services aren't as critical, and
it's a significant amount of work to implement switching between the
operating modes to call between kernel and firmware for thise cases. So
I'm ignoring it for now.

v5:
* Fixed some printk strings based on feedback
* Renamed 32/64-bit specific types to not have _ prefix
* Fixed bug in printout of efi runtime disablement

v4:
* Some of the earlier cleanup was accidentally reverted by this patch, fixed.
* Reworded some messages to not have to line wrap printk strings

v3:
* Reorganized to a series of patches to make it easier to review, and
  do some of the cleanups I had left out before.

v2:
* Added graceful error handling for 32-bit kernel that gets passed
  EFI data above 4GB.
* Removed some warnings that were missed in first version.

Signed-off-by: Olof Johansson <olof@lixom.net>
Link: http://lkml.kernel.org/r/1329081869-20779-6-git-send-email-olof@lixom.net
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-23 18:54:51 -08:00
Grant Likely
b4e518547d irq_domain/x86: Convert x86 (embedded) to use common irq_domain
This patch removes the x86-specific definition of irq_domain and replaces
it with the common implementation.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2012-02-23 14:37:47 -07:00
Greg Kroah-Hartman
d6126ef5f3 x86/mce: Convert static array of pointers to per-cpu variables
When I previously fixed up the mce_device code, I used a static array of
the pointers.  It was (rightfully) pointed out to me that I should be
using the per_cpu code instead.

This patch converts the code over to that structure, moving the variable
back into the per_cpu area, like it used to be for 3.2 and earlier.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: https://lkml.org/lkml/2012/1/27/165
Signed-off-by: Tony Luck <tony.luck@intel.com>
2012-02-22 12:58:06 -08:00
H. Peter Anvin
513c4ec6e4 x86, cpufeature: Add CPU features from Intel document 319433-012A
Add CPU features from the Intel Archicture Instruction Set Extensions
Programming Reference version 012A (Feb 2012), document number 319433-012A.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-21 17:25:50 -08:00
H. Peter Anvin
22e842d4d9 x32: Fix coding style violations in the x32 VDSO code
Move the prototype for x32_setup_additional_pages() to a header file,
and adjust the coding style to match standard.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: H. J. Lu <hjl.tools@gmail.com>
2012-02-21 14:32:19 -08:00
Linus Torvalds
1361b83a13 i387: Split up <asm/i387.h> into exported and internal interfaces
While various modules include <asm/i387.h> to get access to things we
actually *intend* for them to use, most of that header file was really
pretty low-level internal stuff that we really don't want to expose to
others.

So split the header file into two: the small exported interfaces remain
in <asm/i387.h>, while the internal definitions that are only used by
core architecture code are now in <asm/fpu-internal.h>.

The guiding principle for this was to expose functions that we export to
modules, and leave them in <asm/i387.h>, while stuff that is used by
task switching or was marked GPL-only is in <asm/fpu-internal.h>.

The fpu-internal.h file could be further split up too, especially since
arch/x86/kvm/ uses some of the remaining stuff for its module.  But that
kvm usage should probably be abstracted out a bit, and at least now the
internal FPU accessor functions are much more contained.  Even if it
isn't perhaps as contained as it _could_ be.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-21 14:12:54 -08:00
Linus Torvalds
8546c00892 i387: Uninline the generic FP helpers that we expose to kernel modules
Instead of exporting the very low-level internals of the FPU state
save/restore code (ie things like 'fpu_owner_task'), we should export
the higher-level interfaces.

Inlining these things is pointless anyway: sure, sometimes the end
result is small, but while 'stts()' can result in just three x86
instructions, those are not cheap instructions (writing %cr0 is a
serializing instruction and a very slow one at that).

So the overhead of a function call is not noticeable, and we really
don't want random modules mucking about with our internal state save
logic anyway.

So this unexports 'fpu_owner_task', and instead uninlines and exports
the actual functions that modules can use: fpu_kernel_begin/end() and
unlazy_fpu().

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211339590.5354@i5.linux-foundation.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-21 14:12:46 -08:00
H. Peter Anvin
d1a797f388 x32: Handle process creation
Allow an x32 process to be started.

Originally-by: H. J. Lu <hjl.tools@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
2012-02-20 12:52:05 -08:00
H. Peter Anvin
a96d692e9a x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h>
Unfortunately a lot of the compat types are guarded with CONFIG_COMPAT
or the equivalent, so add a similar guard to <asm/sys_ia32.h> to avoid
compilation failures when CONFIG_COMPAT=n.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:52:05 -08:00
H. Peter Anvin
fca460f95e x32: Handle the x32 system call flag
x32 shares most system calls with x86-64, but unfortunately some
subsystem (the input subsystem is the chief offender) which require
is_compat() when operating with a 32-bit userspace.  The input system
actually has text files in sysfs whose meaning is dependent on
sizeof(long) in userspace!

We could solve this by having two completely disjoint system call
tables; requiring that each system call be duplicated.  This patch
takes a different approach: we add a flag to the system call number;
this flag doesn't affect the system call dispatch but requests compat
treatment from affected subsystems for the duration of the system call.

The change of cmpq to cmpl is safe since it immediately follows the
and.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:52:05 -08:00
H. Peter Anvin
9d3897630e x32: Add rt_sigframe_x32
Add rt_sigframe_x32 to <asm/sigframe.h>.  Unfortunately we can't just
define all the data structures unconditionally, due to the #ifdef
CONFIG_COMPAT in <linux/compat.h> and its trickle-down effects, hence
the #ifdef mess.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:52:05 -08:00
H. J. Lu
4048e2a8d4 x32: Add struct ucontext_x32
Add a definition for struct ucontext_x32; this is inherently a mix of
the 32- and 64-bit versions.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:52:04 -08:00
H. Peter Anvin
851394229e x32: Export setup/restore_sigcontext from signal.c
Export setup_sigcontext() and restore_sigcontext() from signal.c, so
we can use the 64-bit versions verbatim for x32.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:52:04 -08:00
H. Peter Anvin
f28f0c2357 x86: Move some signal-handling definitions to a common header
There are some definitions which are duplicated between
kernel/signal.c and ia32/ia32_signal.c; move them to a common header
file.

Rather than adding stuff to existing header files which contain data
structures, create a new header file; hence the slightly odd name
("all the good ones were taken.")

Note: nothing relied on signal_fault() being defined in
<asm/ptrace.h>.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:52:04 -08:00
H. Peter Anvin
ea499fec48 x32: Generate <asm/unistd_64_x32.h>
Generate macros for the *kernel* code to use to refer to x32 system
calls.  These have an __NR_x32_ prefix and do not include
__X32_SYSCALL_BIT.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:52:04 -08:00
H. Peter Anvin
6cbb369f57 x32: Generate <asm/unistd_x32.h>
Generate <asm/unistd_x32.h>; this exports x32 system call numbers to
user space.

[ v2: Enclose all arguments to syshdr in '' so empty arguments aren't
      dropped on the floor. ]

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:51:00 -08:00
H. Peter Anvin
2c73ce7346 x86-64, ia32: Drop sys32_rt_sigprocmask
On x86, the only difference between sys_rt_sigprocmask and
sys32_rt_sigprocmask is the alignment of the data structures.
However, x86 allows data accesses with arbitrary alignment, and
therefore there is no reason for this code to be different.

Reported-by: Gregory M. Lueck <gregory.m.lueck@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:48:49 -08:00
H. Peter Anvin
bb2127240c x32: Add a thread flag for x32 processes
An x32 process is *almost* the same thing as a 64-bit process with a
32-bit address limit, but there are a few minor differences -- in
particular core dumps are 32 bits and signal handling is different.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:48:49 -08:00
H. J. Lu
d046ff8b30 x86-64: Add prototype for old_rsp to a header file
So far this has only been used in process_64.c, but the x32 code will
need it in additional code.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:48:48 -08:00
H. Peter Anvin
1f5e27a90a x32: Create posix_types_x32.h
This is the same as the 64-bit posix_types.h, except that
__kernel_[u]long_t is defined to be [unsigned] long long and therefore
64 bits.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:48:47 -08:00
H. Peter Anvin
4f72e331c2 x86-64: Use explicit sizes in sigcontext.h, prepare for x32
Use explicit sizes (__u64) instead of implicit sizes (unsigned long)
in the definition for sigcontext.h; this will allow this structure to
be shared between the x86-64 native ABI and the x32 ABI.

Originally-by: H. J. Lu <hjl.tools@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/n/tip-4pr1xnnksprt7t0h3w5fw4rv@git.kernel.org
2012-02-20 12:48:46 -08:00
H. Peter Anvin
6bd330083e x86: Factor out TIF_IA32 from 32-bit address space
Factor out IA32 (compatibility instruction set) from 32-bit address
space in the thread_info flags; this is a precondition patch for x32
support.

Originally-by: H. J. Lu <hjl.tools@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/n/tip-4pr1xnnksprt7t0h3w5fw4rv@git.kernel.org
2012-02-20 12:48:46 -08:00
Linus Torvalds
7e16838d94 i387: support lazy restore of FPU state
This makes us recognize when we try to restore FPU state that matches
what we already have in the FPU on this CPU, and avoids the restore
entirely if so.

To do this, we add two new data fields:

 - a percpu 'fpu_owner_task' variable that gets written any time we
   update the "has_fpu" field, and thus acts as a kind of back-pointer
   to the task that owns the CPU.  The exception is when we save the FPU
   state as part of a context switch - if the save can keep the FPU
   state around, we leave the 'fpu_owner_task' variable pointing at the
   task whose FP state still remains on the CPU.

 - a per-thread 'last_cpu' field, that indicates which CPU that thread
   used its FPU on last.  We update this on every context switch
   (writing an invalid CPU number if the last context switch didn't
   leave the FPU in a lazily usable state), so we know that *that*
   thread has done nothing else with the FPU since.

These two fields together can be used when next switching back to the
task to see if the CPU still matches: if 'fpu_owner_task' matches the
task we are switching to, we know that no other task (or kernel FPU
usage) touched the FPU on this CPU in the meantime, and if the current
CPU number matches the 'last_cpu' field, we know that this thread did no
other FP work on any other CPU, so the FPU state on the CPU must match
what was saved on last context switch.

In that case, we can avoid the 'f[x]rstor' entirely, and just clear the
CR0.TS bit.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-20 10:58:54 -08:00
Linus Torvalds
80ab6f1e8c i387: use 'restore_fpu_checking()' directly in task switching code
This inlines what is usually just a couple of instructions, but more
importantly it also fixes the theoretical error case (can that FPU
restore really ever fail? Maybe we should remove the checking).

We can't start sending signals from within the scheduler, we're much too
deep in the kernel and are holding the runqueue lock etc.  So don't
bother even trying.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-20 10:58:28 -08:00
Linus Torvalds
cea20ca3f3 i387: fix up some fpu_counter confusion
This makes sure we clear the FPU usage counter for newly created tasks,
just so that we start off in a known state (for example, don't try to
preload the FPU state on the first task switch etc).

It also fixes a thinko in when we increment the fpu_counter at task
switch time, introduced by commit 34ddc81a230b ("i387: re-introduce FPU
state preloading at context switch time").  We should increment the
*new* task fpu_counter, not the old task, and only if we decide to use
that state (whether lazily or preloaded).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-20 10:24:09 -08:00