31838 Commits

Author SHA1 Message Date
Ingo Molnar
9ed8f1a6e7 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-28 08:27:17 +01:00
Ingo Molnar
0614621d89 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-28 07:50:39 +01:00
Tommi Hirvola
7748168c66 crypto: x86/poly1305 - Clear key material from stack in SSE2 variant
1-block SSE2 variant of poly1305 stores variables s1..s4 containing key
material on the stack. This commit adds missing zeroing of the stack
memory. Benchmarks show negligible performance hit (tested on i7-3770).

Signed-off-by: Tommi Hirvola <tommi@hirvola.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-02-28 14:17:59 +08:00
Thomas Gleixner
cfbe271667 y2038: additional syscall ABI cleanup
This is a follow-up to the y2038 syscall patches already merged in the tip
 tree.  As the final 32-bit RISC-V syscall ABI is still being decided on,
 this is the last chance to make a few corrections to leave out interfaces
 based on 32-bit time_t along with the old off_t and rlimit types.
 
 The series achieves this in a few steps:
 
 - A couple of bug fixes for minor regressions I introduced
   in the original series
 
 - A couple of older patches from Yury Norov that I had never
   merged in the past, these fix up the openat/open_by_handle_at and
   getrlimit/setrlimit syscalls to disallow the old versions of off_t
   and rlimit.
 
 - Hiding the deprecated system calls behind an #ifdef in
   include/uapi/asm-generic/unistd.h
 
 - Change arch/riscv to drop all these ABIs.
 
 Originally, the plan was to also leave these out on C-Sky, but that now
 has a glibc port that uses the older interfaces, so we need to leave
 them in place.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJcdEhGAAoJEGCrR//JCVInQuUQAN+mRFzRXAqhbpb63/vYGJei
 nmDqB+SoxzaIKAIGAVIdMGUoFxBrY1oyS4m6/a9lzQ9G4aSkr0PruZnUID+vIo2h
 rj+3FBlB/c9nvW+NG8iEtVadlRbTmoRILCWpvgIuLNd6fwvNzP3V4uu6a1QRIMx4
 aUCWQfhzv18kW1EAPIroPA1gEL2HKbhDdEuN2V0SKnsKNiWkHQeswWQFAYpLgT36
 eZ+L52lh+miEdtBxycxJ5lh3KsWO4dPImh+QHONZgeB9iS8v47K0R6ONKm4NMeQV
 5KW55pepUq1uQUdEU9KRrh2krMih2IJbOQoN2lvb2ao5UG6erHbj0N55RQym5gSC
 +TrvP3dnqfohh9hWdHDwME+5OTeOM+8SUMRnaZBJKuywzo7W1ceLpf+KZjwlk2s5
 AgEX67fKrUbtBfTgVhzlYhJLWcgSD1yt64ed5SF15c5M3JZhkK8cd50dB9pM2/YB
 o9VbijkYwb2KyCNUiV3nghgiiqcROvOIO7PK6z3XFFiRm/Gn2CgNZyZa7c4+Vgrr
 PM/DmDvCdFqYnqBOlV2ilCLigKGN0JgwzMXnbQU77d71Yg7Bco8e/yqSucSilp2d
 lEv44extu9FINWXIqvWEjRqdSq+sNgj21VSp6Zu/GaTgNCQKac2wsAZtnQgnslko
 knKwwp525fjqnJEDd1aH
 =/iFA
 -----END PGP SIGNATURE-----

Merge tag 'y2038-syscall-abi' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground into timers/2038

Pull additional syscall ABI cleanup for y2038 from Arnd Bergmann:

This is a follow-up to the y2038 syscall patches already merged in the tip
tree.  As the final 32-bit RISC-V syscall ABI is still being decided on,
this is the last chance to make a few corrections to leave out interfaces
based on 32-bit time_t along with the old off_t and rlimit types.

The series achieves this in a few steps:

- A couple of bug fixes for minor regressions I introduced
  in the original series

- A couple of older patches from Yury Norov that I had never
  merged in the past, these fix up the openat/open_by_handle_at and
  getrlimit/setrlimit syscalls to disallow the old versions of off_t
  and rlimit.

- Hiding the deprecated system calls behind an #ifdef in
  include/uapi/asm-generic/unistd.h

- Change arch/riscv to drop all these ABIs.

Originally, the plan was to also leave these out on C-Sky, but that now
has a glibc port that uses the older interfaces, so we need to leave
them in place.
2019-02-27 21:45:27 +01:00
Tony Luck
41f035a86b x86/mce: Improve error message when kernel cannot recover, p2
In

  c7d606f560e4 ("x86/mce: Improve error message when kernel cannot recover")

a case was added for a machine check caused by a DATA access to poison
memory from the kernel. A case should have been added also for an
uncorrectable error during an instruction fetch in the kernel.

Add that extra case so the error message now reads:

  mce: [Hardware Error]: Machine check: Instruction fetch error in kernel

Fixes: c7d606f560e4 ("x86/mce: Improve error message when kernel cannot recover")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190225205940.15226-1-tony.luck@intel.com
2019-02-25 23:21:35 +01:00
Borislav Petkov
2e7614c073 x86/uaccess: Remove unused __addr_ok() macro
This was caught while staring at the whole {set,get}_fs() machinery.

It's last user, the 32-bit version of strnlen_user() went away with

  5723aa993d83 ("x86: use the new generic strnlen_user() function")

so drop it.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Link: https://lkml.kernel.org/r/20190225191109.7671-1-bp@alien8.de
2019-02-25 23:13:05 +01:00
Andy Lutomirski
2a418cf3f5 x86/uaccess: Don't leak the AC flag into __put_user() value evaluation
When calling __put_user(foo(), ptr), the __put_user() macro would call
foo() in between __uaccess_begin() and __uaccess_end().  If that code
were buggy, then those bugs would be run without SMAP protection.

Fortunately, there seem to be few instances of the problem in the
kernel. Nevertheless, __put_user() should be fixed to avoid doing this.
Therefore, evaluate __put_user()'s argument before setting AC.

This issue was noticed when an objtool hack by Peter Zijlstra complained
about genregs_get() and I compared the assembly output to the C source.

 [ bp: Massage commit message and fixed up whitespace. ]

Fixes: 11f1a4b9755f ("x86: reorganize SMAP handling in user space accesses")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20190225125231.845656645@infradead.org
2019-02-25 20:17:05 +01:00
Linus Torvalds
53a41cb7ed Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses"
This reverts commit 9da3f2b74054406f87dff7101a569217ffceb29b.

It was well-intentioned, but wrong.  Overriding the exception tables for
instructions for random reasons is just wrong, and that is what the new
code did.

It caused problems for tracing, and it caused problems for strncpy_from_user(),
because the new checks made perfectly valid use cases break, rather than
catch things that did bad things.

Unchecked user space accesses are a problem, but that's not a reason to
add invalid checks that then people have to work around with silly flags
(in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
odd way to say "this commit was wrong" and was sprinked into random
places to hide the wrongness).

The real fix to unchecked user space accesses is to get rid of the
special "let's not check __get_user() and __put_user() at all" logic.
Make __{get|put}_user() be just aliases to the regular {get|put}_user()
functions, and make it impossible to access user space without having
the proper checks in places.

The raison d'être of the special double-underscore versions used to be
that the range check was expensive, and if you did multiple user
accesses, you'd do the range check up front (like the signal frame
handling code, for example).  But SMAP (on x86) and PAN (on ARM) have
made that optimization pointless, because the _real_ expense is the "set
CPU flag to allow user space access".

Do let's not break the valid cases to catch invalid cases that shouldn't
even exist.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-25 09:10:51 -08:00
David S. Miller
70f3522614 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three conflicts, one of which, for marvell10g.c is non-trivial and
requires some follow-up from Heiner or someone else.

The issue is that Heiner converted the marvell10g driver over to
use the generic c45 code as much as possible.

However, in 'net' a bug fix appeared which makes sure that a new
local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
is cleared.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24 12:06:19 -08:00
Alexander Shishkin
c60f83b813 perf, pt, coresight: Fix address filters for vmas with non-zero offset
Currently, the address range calculation for file-based filters works as
long as the vma that maps the matching part of the object file starts
from offset zero into the file (vm_pgoff==0). Otherwise, the resulting
filter range would be off by vm_pgoff pages. Another related problem is
that in case of a partially matching vma, that is, a vma that matches
part of a filter region, the filter range size wouldn't be adjusted.

Fix the arithmetics around address filter range calculations, taking
into account vma offset, so that the entire calculation is done before
the filter configuration is passed to the PMU drivers instead of having
those drivers do the final bit of arithmetics.

Based on the patch by Adrian Hunter <adrian.hunter.intel.com>.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Fixes: 375637bc5249 ("perf/core: Introduce address range filtering")
Link: http://lkml.kernel.org/r/20190215115655.63469-3-alexander.shishkin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-02-22 16:52:07 -03:00
Yu Zhang
de3ccd26fa KVM: MMU: record maximum physical address width in kvm_mmu_extended_role
Previously, commit 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow
MMU reconfiguration is needed") offered some optimization to avoid
the unnecessary reconfiguration. Yet one scenario is broken - when
cpuid changes VM's maximum physical address width, reconfiguration
is needed to reset the reserved bits.  Also, the TDP may need to
reset its shadow_root_level when this value is changed.

To fix this, a new field, maxphyaddr, is introduced in the extended
role structure to keep track of the configured guest physical address
width.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-22 19:25:10 +01:00
Yu Zhang
511da98d20 kvm: x86: Return LA57 feature based on hardware capability
Previously, 'commit 372fddf70904 ("x86/mm: Introduce the 'no5lvl' kernel
parameter")' cleared X86_FEATURE_LA57 in boot_cpu_data, if Linux chooses
to not run in 5-level paging mode. Yet boot_cpu_data is queried by
do_cpuid_ent() as the host capability later when creating vcpus, and Qemu
will not be able to detect this feature and create VMs with LA57 feature.

As discussed earlier, VMs can still benefit from extended linear address
width, e.g. to enhance features like ASLR. So we would like to fix this,
by return the true hardware capability when Qemu queries.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-22 19:25:05 +01:00
Vitaly Kuznetsov
ad7dc69aeb x86/kvm/mmu: fix switch between root and guest MMUs
Commit 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu") brought one subtle
change: previously, when switching back from L2 to L1, we were resetting
MMU hooks (like mmu->get_cr3()) in kvm_init_mmu() called from
nested_vmx_load_cr3() and now we do that in nested_ept_uninit_mmu_context()
when we re-target vcpu->arch.mmu pointer.
The change itself looks logical: if nested_ept_init_mmu_context() changes
something than nested_ept_uninit_mmu_context() restores it back. There is,
however, one thing: the following call chain:

 nested_vmx_load_cr3()
  kvm_mmu_new_cr3()
    __kvm_mmu_new_cr3()
      fast_cr3_switch()
        cached_root_available()

now happens with MMU hooks pointing to the new MMU (root MMU in our case)
while previously it was happening with the old one. cached_root_available()
tries to stash current root but it is incorrect to read current CR3 with
mmu->get_cr3(), we need to use old_mmu->get_cr3() which in case we're
switching from L2 to L1 is guest_mmu. (BTW, in shadow page tables case this
is a non-issue because we don't switch MMU).

While we could've tried to guess that we're switching between MMUs and call
the right ->get_cr3() from cached_root_available() this seems to be overly
complicated. Instead, just stash the corresponding CR3 when setting
root_hpa and make cached_root_available() use the stashed value.

Fixes: 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-22 19:24:48 +01:00
Marcelo Tosatti
7539b174ae x86: kvmguest: use TSC clocksource if invariant TSC is exposed
The invariant TSC bit has the following meaning:

"The time stamp counter in newer processors may support an enhancement,
referred to as invariant TSC. Processor's support for invariant TSC
is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run
at a constant rate in all ACPI P-, C-. and T-states. This is the
architectural behavior moving forward. On processors with invariant TSC
support, the OS may use the TSC for wall clock timer services (instead
of ACPI or HPET timers). TSC reads are much more efficient and do not
incur the overhead associated with a ring transition or access to a
platform resource."

IOW, TSC does not change frequency. In such case, and with
TSC scaling hardware available to handle migration, it is possible
to use the TSC clocksource directly, whose system calls are
faster.

Reduce the rating of kvmclock clocksource to allow TSC clocksource
to be the default if invariant TSC is exposed.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

v2: Use feature bits and tsc_unstable() check (Sean Christopherson)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:52 +01:00
Sean Christopherson
8ab3c471ee KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()
...via a new helper, __kvm_mmu_zap_all().  An alternative to passing a
'bool mmio_only' would be to pass a callback function to filter the
shadow page, i.e. to make __kvm_mmu_zap_all() generic and reusable, but
zapping all shadow pages is a last resort, i.e. making the helper less
extensible is a feature of sorts.  And the explicit MMIO parameter makes
it easy to preserve the WARN_ON_ONCE() if a restart is triggered when
zapping MMIO sptes.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:49 +01:00
Sean Christopherson
24efe61f69 KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
Paolo expressed a concern that kvm_mmu_zap_mmio_sptes() could have a
quadratic runtime[1], i.e. restarting the spte walk while zapping only
MMIO sptes could result in re-walking large portions of the list over
and over due to the non-MMIO sptes encountered before the restart not
being removed.

At the time, the concern was legitimate as the walk was restarted when
any spte was zapped.  But that is no longer the case as the walk is now
restarted iff one or more children have been zapped, which is necessary
because zapping children makes the active_mmu_pages list unstable.

Furthermore, it should be impossible for an MMIO spte to have children,
i.e. zapping an MMIO spte should never result in zapping children.  In
other words, kvm_mmu_zap_mmio_sptes() should never restart its walk, and
so should always execute in linear time.  WARN if this assertion fails.

Although it should never be needed, leave the restart logic in place.
In normal operation, the cost is at worst an extra CMP+Jcc, and if for
some reason the list does become unstable, not restarting would likely
crash KVM, or worse, the kernel.

[1] https://patchwork.kernel.org/patch/10756589/#22452085

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:48 +01:00
Sean Christopherson
83cdb56864 KVM: x86/mmu: Differentiate between nr zapped and list unstable
The return value of kvm_mmu_prepare_zap_page() has evolved to become
overloaded to convey two separate pieces of information.  1) was at
least one page zapped and 2) has the list of MMU pages become unstable.

In it's original incarnation (as kvm_mmu_zap_page()), there was no
return value at all.  Commit 0738541396be ("KVM: MMU: awareness of new
kvm_mmu_zap_page behaviour") added a return value in preparation for
commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core").  Although
the return value was of type 'int', it was actually used as a boolean
to indicate whether or not active_mmu_pages may have become unstable due
to zapping children.  Walking a list with list_for_each_entry_safe()
only protects against deleting/moving the current entry, i.e. zapping a
child page would break iteration due to modifying any number of entries.

Later, commit 60c8aec6e2c9 ("KVM: MMU: use page array in unsync walk")
modified mmu_zap_unsync_children() to return an approximation of the
number of children zapped.  This was not intentional, it was simply a
side effect of how the code was written.

The unintented side affect was then morphed into an actual feature by
commit 77662e0028c7 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
path"), which modified kvm_mmu_change_mmu_pages() to use the number of
zapped pages when determining the number of MMU pages in use by the VM.

Finally, commit 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed") added the initial page to the
return value to make its behavior more consistent with what most users
would expect.  Incorporating the initial parent page in the return value
of kvm_mmu_zap_page() breaks the original usage of restarting a list
walk on a non-zero return value to handle a potentially unstable list,
i.e. walks will unnecessarily restart when any page is zapped.

Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
return a boolean to indicate that the list may be unstable and move the
number of zapped children to a dedicated parameter.  Since the majority
of callers to kvm_mmu_prepare_zap_page() don't care about either return
value, preserve the current definition of kvm_mmu_prepare_zap_page() by
making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page().  This
avoids having to update every call site and also provides cleaner code
for functions that only care about the number of pages zapped.

Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return
                      the number of pages it actually freed")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:48 +01:00
Sean Christopherson
ea145aacf4 Revert "KVM: MMU: fast invalidate all pages"
Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
from the original series[1], now that all users of the fast invalidate
mechanism are gone.

This reverts commit 5304b8d37c2a5ebca48330f5e7868d240eafbed1.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:47 +01:00
Sean Christopherson
5d6317ca4e KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptes
Call cond_resched_lock() when zapping all sptes to reschedule if needed
or to release and reacquire mmu_lock in case of contention.  There is no
need to flush or zap when temporarily dropping mmu_lock as zapping all
sptes is done only when the owning userspace VMM has exited or when the
VM is being destroyed, i.e. there is no interplay with memslots or MMIO
generations to worry about.

Be paranoid and restart the walk if mmu_lock is dropped to avoid any
potential issues with consuming a stale iterator.  The overhead in doing
so is negligible as at worst there will be a few root shadow pages at
the head of the list, i.e. the iterator is essentially the head of the
list already.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:46 +01:00
Sean Christopherson
8a674adc11 KVM: x86/mmu: skip over invalid root pages when zapping all sptes
...to guarantee forward progress.  When zapped, root pages are marked
invalid and moved to the head of the active pages list until they are
explicitly freed.  Theoretically, having unzappable root pages at the
head of the list could prevent kvm_mmu_zap_all() from making forward
progress were a future patch to add a loop restart after processing a
page, e.g. to drop mmu_lock on contention.

Although kvm_mmu_prepare_zap_page() can theoretically take action on
invalid pages, e.g. to zap unsync children, functionally it's not
necessary (root pages will be re-zapped when freed) and practically
speaking the odds of e.g. @unsync or @unsync_children becoming %true
while zapping all pages is basically nil.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:46 +01:00
Sean Christopherson
7390de1e99 Revert "KVM: x86: use the fast way to invalidate all pages"
Revert to a slow kvm_mmu_zap_all() for kvm_arch_flush_shadow_all().
Flushing all shadow entries is only done during VM teardown, i.e.
kvm_arch_flush_shadow_all() is only called when the associated MM struct
is being released or when the VM instance is being freed.

Although the performance of teardown itself isn't critical, KVM should
still voluntarily schedule to play nice with the rest of the kernel;
but that can be done without the fast invalidate mechanism in a future
patch.

This reverts commit 6ca18b6950f8dee29361722f28f69847724b276f.

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:45 +01:00
Sean Christopherson
b59c4830ca Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints"
...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
is one part of a revert all patches from the series that introduced the
mechanism[1].

This reverts commit 2248b023219251908aedda0621251cffc548f258.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:44 +01:00
Sean Christopherson
42560fb1f3 Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages"
...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
is one part of a revert all patches from the series that introduced the
mechanism[1].

This reverts commit 35006126f024f68727c67001b9cb703c38f69268.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:44 +01:00
Sean Christopherson
43d2b14b10 Revert "KVM: MMU: zap pages in batch"
Unwinding optimizations related to obsolete pages is a step towards
removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
a revert all patches from the series that introduced the mechanism[1].

This reverts commit e7d11c7a894986a13817c1c001e1e7668c5c4eb4.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:43 +01:00
Sean Christopherson
210f494261 Revert "KVM: MMU: collapse TLB flushes when zap all pages"
Unwinding optimizations related to obsolete pages is a step towards
removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
a revert all patches from the series that introduced the mechanism[1].

This reverts commit f34d251d66ba263c077ed9d2bbd1874339a4c887.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:42 +01:00
Sean Christopherson
52d5dedc79 Revert "KVM: MMU: reclaim the zapped-obsolete page first"
Unwinding optimizations related to obsolete pages is a step towards
removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
a revert all patches from the series that introduced the mechanism[1].

This reverts commit 365c886860c4ba670d245e762b23987c912c129a.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:42 +01:00
Sean Christopherson
5ff0568374 KVM: x86/mmu: Remove is_obsolete() call
Unwinding usage of is_obsolete() is a step towards removing x86's fast
invalidate mechanism, i.e. this is one part of a revert all patches from
the series that introduced the mechanism[1].

This is a partial revert of commit 05988d728dcd ("KVM: MMU: reduce
KVM_REQ_MMU_RELOAD when root page is zapped").

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:41 +01:00
Sean Christopherson
571c5af06e KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes
Call cond_resched_lock() when zapping MMIO to reschedule if needed or to
release and reacquire mmu_lock in case of contention.  There is no need
to flush or zap when temporarily dropping mmu_lock as zapping MMIO sptes
is done when holding the memslots lock and with the "update in-progress"
bit set in the memslots generation, which disables MMIO spte caching.
The walk does need to be restarted if mmu_lock is dropped as the active
pages list may be modified.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:40 +01:00
Sean Christopherson
4771450c34 Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes"
Revert back to a dedicated (and slower) mechanism for handling the
scenario where all MMIO shadow PTEs need to be zapped due to overflowing
the MMIO generation number.  The MMIO generation scenario is almost
literally a one-in-a-million occurrence, i.e. is not a performance
sensitive scenario.

Restoring kvm_mmu_zap_mmio_sptes() leaves VM teardown as the only user
of kvm_mmu_invalidate_zap_all_pages() and paves the way for removing
the fast invalidate mechanism altogether.

This reverts commit a8eca9dcc656a405a28ffba43f3d86a1ff0eb331.

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:40 +01:00
Sean Christopherson
a592a3b8fc Revert "KVM: MMU: document fast invalidate all pages"
Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
from the original series[1].

Though not explicitly stated, for all intents and purposes the fast
invalidate mechanism was added to speed up the scenario where removing
a memslot, e.g. as part of accessing reading PCI ROM, caused KVM to
flush all shadow entries[1].  Now that the memslot case flushes only
shadow entries belonging to the memslot, i.e. doesn't use the fast
invalidate mechanism, the only remaining usage of the mechanism are
when the VM is being destroyed and when the MMIO generation rolls
over.

When a VM is being destroyed, either there are no active vcpus, i.e.
there's no lock contention, or the VM has ungracefully terminated, in
which case we want to reclaim its pages as quickly as possible, i.e.
not release the MMU lock if there are still CPUs executing in the VM.

The MMIO generation scenario is almost literally a one-in-a-million
occurrence, i.e. is not a performance sensitive scenario.

Given that lock-breaking is not desirable (VM teardown) or irrelevant
(MMIO generation overflow), remove the fast invalidate mechanism to
simplify the code (a small amount) and to discourage future code from
zapping all pages as using such a big hammer should be a last restort.

This reverts commit f6f8adeef542a18b1cb26a0b772c9781a10bb477.

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:39 +01:00
Sean Christopherson
4e103134b8 KVM: x86/mmu: Zap only the relevant pages when removing a memslot
Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
that actually belong to the memslot being removed.  This improves
performance, especially why the deleted memslot has only a few shadow
entries, or even no entries.  E.g. a microbenchmark to access regular
memory while concurrently reading PCI ROM to trigger memslot deletion
showed a 5% improvement in throughput.

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:39 +01:00
Sean Christopherson
a21136345c KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap()
...and into a separate helper, kvm_mmu_remote_flush_or_zap(), that does
not require a vcpu so that the code can be (re)used by
kvm_mmu_invalidate_zap_pages_in_memslot().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:38 +01:00
Sean Christopherson
85875a133e KVM: x86/mmu: Move slot_level_*() helper functions up a few lines
...so that kvm_mmu_invalidate_zap_pages_in_memslot() can utilize the
helpers in future patches.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:37 +01:00
Sean Christopherson
164bf7e56c KVM: Move the memslot update in-progress flag to bit 63
...now that KVM won't explode by moving it out of bit 0.  Using bit 63
eliminates the need to jump over bit 0, e.g. when calculating a new
memslots generation or when propagating the memslots generation to an
MMIO spte.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:37 +01:00
Sean Christopherson
cae7ed3c2c KVM: x86: Refactor the MMIO SPTE generation handling
The code to propagate the memslots generation number into MMIO sptes is
a bit convoluted.  The "what" is relatively straightfoward, e.g. the
comment explaining which bits go where is quite readable, but the "how"
requires a lot of staring to understand what is happening.  For example,
'MMIO_GEN_LOW_SHIFT' is actually used to calculate the high bits of the
spte, while 'MMIO_SPTE_GEN_LOW_SHIFT' is used to calculate the low bits.

Refactor the code to:

  - use #defines whose values align with the bits defined in the comment
  - use consistent code for both the high and low mask
  - explicitly highlight the handling of bit 0 (update in-progress flag)
  - explicitly call out that the defines are for MMIO sptes (to avoid
    confusion with the per-vCPU MMIO cache, which uses the full memslots
    generation)

In addition to making the code a little less magical, this paves the way
for moving the update in-progress flag to bit 63 without having to
simultaneously rewrite all of the MMIO spte code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:35 +01:00
Sean Christopherson
5192f9b976 KVM: x86: Use a u64 when passing the MMIO gen around
KVM currently uses an 'unsigned int' for the MMIO generation number
despite it being derived from the 64-bit memslots generation and
being propagated to (potentially) 64-bit sptes.  There is no hidden
agenda behind using an 'unsigned int', it's done simply because the
MMIO generation will never set bits above bit 19.

Passing a u64 will allow the "update in-progress" flag to be relocated
from bit 0 to bit 63 and removes the need to cast the generation back
to a u64 when propagating it to a spte.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:35 +01:00
Sean Christopherson
361209e054 KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing.  Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.

Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...

Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.

Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag.  This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:34 +01:00
Sean Christopherson
ddfd1730fd KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
When installing new memslots, KVM sets bit 0 of the generation number to
indicate that an update is in-progress.  Until the update is complete,
there are no guarantees as to whether a vCPU will see the old or the new
memslots.  Explicity prevent caching MMIO accesses so as to avoid using
an access cached from the old memslots after the new memslots have been
installed.

Note that it is unclear whether or not disabling caching during the
update window is strictly necessary as there is no definitive
documentation as to what ordering guarantees KVM provides with respect
to updating memslots.  That being said, the MMIO spte code does not
allow reusing sptes created while an update is in-progress, and the
associated documentation explicitly states:

    We do not want to use an MMIO sptes created with an odd generation
    number, ...  If KVM is unlucky and creates an MMIO spte while the
    low bit is 1, the next access to the spte will always be a cache miss.

At the very least, disabling the per-vCPU MMIO cache during updates will
make its behavior consistent with the MMIO spte behavior and
documentation.

Fixes: 56f17dd3fbc4 ("kvm: x86: fix stale mmio cache bug")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:33 +01:00
Sean Christopherson
e1359e2beb KVM: x86/mmu: Detect MMIO generation wrap in any address space
The check to detect a wrap of the MMIO generation explicitly looks for a
generation number of zero.  Now that unique memslots generation numbers
are assigned to each address space, only address space 0 will get a
generation number of exactly zero when wrapping.  E.g. when address
space 1 goes from 0x7fffe to 0x80002, the MMIO generation number will
wrap to 0x2.  Adjust the MMIO generation to strip the address space
modifier prior to checking for a wrap.

Fixes: 4bd518f1598d ("KVM: use separate generations for each address space")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:33 +01:00
Sean Christopherson
152482580a KVM: Call kvm_arch_memslots_updated() before updating memslots
kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.

Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots.  Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.

Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:32 +01:00
Ben Gardon
4183683918 kvm: vmx: Add memcg accounting to KVM allocations
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.

Tested:
	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
	introduced no new failures.
	Ran a kernel memory accounting test which creates a VM to touch
	memory and then checks that the kernel memory allocated for the
	process is within certain bounds.
	With this patch we account for much more of the vmalloc and slab memory
	allocated for the VM.

Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:31 +01:00
Ben Gardon
1ec696470c kvm: svm: Add memcg accounting to KVM allocations
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.

Tested:
	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
	introduced no new failures.
	Ran a kernel memory accounting test which creates a VM to touch
	memory and then checks that the kernel memory allocated for the
	process is within certain bounds.
	With this patch we account for much more of the vmalloc and slab memory
	allocated for the VM.

Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:31 +01:00
Ben Gardon
254272ce65 kvm: x86: Add memcg accounting to KVM allocations
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.

Tested:
	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
	introduced no new failures.
	Ran a kernel memory accounting test which creates a VM to touch
	memory and then checks that the kernel memory allocated for the
	process is within certain bounds.
	With this patch we account for much more of the vmalloc and slab memory
	allocated for the VM.

There remain a few allocations which should be charged to the VM's
cgroup but are not. In x86, they include:
	vcpu->arch.pio_data
There allocations are unaccounted in this patch because they are mapped
to userspace, and accounting them to a cgroup causes problems. This
should be addressed in a future patch.

Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:30 +01:00
Paolo Bonzini
359a6c3ddc KVM: nVMX: do not start the preemption timer hrtimer unnecessarily
The preemption timer can be started even if there is a vmentry
failure during or after loading guest state.  That is pointless,
move the call after all conditions have been checked.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:29 +01:00
Yu Zhang
d92935979a kvm: vmx: Fix typos in vmentry/vmexit control setting
Previously, 'commit f99e3daf94ff ("KVM: x86: Add Intel PT
virtualization work mode")' work mode' offered framework
to support Intel PT virtualization. However, the patch has
some typos in vmx_vmentry_ctrl() and vmx_vmexit_ctrl(), e.g.
used wrong flags and wrong variable, which will cause the
VM entry failure later.

Fixes: 'commit f99e3daf94ff ("KVM: x86: Add Intel PT virtualization work mode")'
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:28 +01:00
Paolo Bonzini
b4b65b5642 KVM: x86: cleanup freeing of nested state
Ensure that the VCPU free path goes through vmx_leave_nested and
thus nested_vmx_vmexit, so that the cancellation of the timer does
not have to be in free_nested.  In addition, because some paths through
nested_vmx_vmexit do not go through sync_vmcs12, the cancellation of
the timer is moved there.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:27 +01:00
Luwei Kang
81b016676e KVM: x86: Sync the pending Posted-Interrupts
Some Posted-Interrupts from passthrough devices may be lost or
overwritten when the vCPU is in runnable state.

The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state
but not running on physical CPU). If a posted interrupt coming at this
time, the irq remmaping facility will set the bit of PIR (Posted
Interrupt Requests) without ON (Outstanding Notification).
So this interrupt can't be sync to APIC virtualization register and
will not be handled by Guest because ON is zero.

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
[Eliminate the pi_clear_sn fast path. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:27 +01:00
Liu Jingqi
c029b5deb0 KVM: x86: expose MOVDIR64B CPU feature into VM.
MOVDIR64B moves 64-bytes as direct-store with 64-bytes write atomicity.
Direct store is implemented by using write combining (WC) for writing
data directly into memory without caching the data.

Availability of the MOVDIR64B instruction is indicated by the presence
of the CPUID feature flag MOVDIR64B (CPUID.0x07.0x0:ECX[bit 28]).

This patch exposes the movdir64b feature to the guest.

The release document ref below link:
https://software.intel.com/sites/default/files/managed/c5/15/\
architecture-instruction-set-extensions-programming-reference.pdf

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Cc: Xu Tao <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:26 +01:00
Liu Jingqi
74f2370bb6 KVM: x86: expose MOVDIRI CPU feature into VM.
MOVDIRI moves doubleword or quadword from register to memory through
direct store which is implemented by using write combining (WC) for
writing data directly into memory without caching the data.

Availability of the MOVDIRI instruction is indicated by the presence of
the CPUID feature flag MOVDIRI(CPUID.0x07.0x0:ECX[bit 27]).

This patch exposes the movdiri feature to the guest.

The release document ref below link:
https://software.intel.com/sites/default/files/managed/c5/15/\
architecture-instruction-set-extensions-programming-reference.pdf

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Cc: Xu Tao <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:26 +01:00
Kai Huang
8acc0993e3 kvm, x86, mmu: Use kernel generic dynamic physical address mask
AMD's SME/SEV is no longer the only case which reduces supported
physical address bits, since Intel introduced Multi-key Total Memory
Encryption (MKTME), which repurposes high bits of physical address as
keyID, thus effectively shrinks supported physical address bits. To
cover both cases (and potential similar future features), kernel MM
introduced generic dynamaic physical address mask instead of hard-coded
__PHYSICAL_MASK in 'commit 94d49eb30e854 ("x86/mm: Decouple dynamic
__PHYSICAL_MASK from AMD SME")'. KVM should use that too.

Change PT64_BASE_ADDR_MASK to use kernel dynamic physical address mask
when it is enabled, instead of sme_clr. PT64_DIR_BASE_ADDR_MASK is also
deleted since it is not used at all.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:25 +01:00