41150 Commits

Author SHA1 Message Date
Paolo Bonzini
6cd88243c7 KVM: x86: do not report a vCPU as preempted outside instruction boundaries
If a vCPU is outside guest mode and is scheduled out, it might be in the
process of making a memory access.  A problem occurs if another vCPU uses
the PV TLB flush feature during the period when the vCPU is scheduled
out, and a virtual address has already been translated but has not yet
been accessed, because this is equivalent to using a stale TLB entry.

To avoid this, only report a vCPU as preempted if sure that the guest
is at an instruction boundary.  A rescheduling request will be delivered
to the host physical CPU as an external interrupt, so for simplicity
consider any vmexit *not* instruction boundary except for external
interrupts.

It would in principle be okay to report the vCPU as preempted also
if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
vmentry/vmexit overhead unnecessarily, and optimistic spinning is
also unlikely to succeed.  However, leave it for later because right
now kvm_vcpu_check_block() is doing memory accesses.  Even
though the TLB flush issue only applies to virtual memory address,
it's very much preferrable to be conservative.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:21:07 -04:00
Paolo Bonzini
54aa83c901 KVM: x86: do not set st->preempted when going back to user space
Similar to the Xen path, only change the vCPU's reported state if the vCPU
was actually preempted.  The reason for KVM's behavior is that for example
optimistic spinning might not be a good idea if the guest is doing repeated
exits to userspace; however, it is confusing and unlikely to make a difference,
because well-tuned guests will hardly ever exit KVM_RUN in the first place.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:21:06 -04:00
Maxim Levitsky
11d39e8cc4 KVM: SVM: fix tsc scaling cache logic
SVM uses a per-cpu variable to cache the current value of the
tsc scaling multiplier msr on each cpu.

Commit 1ab9287add5e2
("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
broke this caching logic.

Refactor the code so that all TSC scaling multiplier writes go through
a single function which checks and updates the cache.

This fixes the following scenario:

1. A CPU runs a guest with some tsc scaling ratio.

2. New guest with different tsc scaling ratio starts on this CPU
   and terminates almost immediately.

   This ensures that the short running guest had set the tsc scaling ratio just
   once when it was set via KVM_SET_TSC_KHZ. Due to the bug,
   the per-cpu cache is not updated.

3. The original guest continues to run, it doesn't restore the msr
   value back to its own value, because the cache matches,
   and thus continues to run with a wrong tsc scaling ratio.

Fixes: 1ab9287add5e2 ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07 11:28:50 -04:00
Ben Gardon
5ba7c4c6d1 KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
Currently disabling dirty logging with the TDP MMU is extremely slow.
On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds
to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with
the shadow MMU.

When disabling dirty logging, zap non-leaf parent entries to allow
replacement with huge pages instead of recursing and zapping all of the
child, leaf entries. This reduces the number of TLB flushes required.
and reduces the disable dirty log time with the TDP MMU to ~3 seconds.

Opportunistically add a WARN() to catch GFNs that are mapped at a
higher level than their max level.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220525230904.1584480-1-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07 11:28:49 -04:00
Jan Beulich
1df931d95f x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()
As noted (and fixed) a couple of times in the past, "=@cc<cond>" outputs
and clobbering of "cc" don't work well together. The compiler appears to
mean to reject such, but doesn't - in its upstream form - quite manage
to yet for "cc". Furthermore two similar macros don't clobber "cc", and
clobbering "cc" is pointless in asm()-s for x86 anyway - the compiler
always assumes status flags to be clobbered there.

Fixes: 989b5db215a2 ("x86/uaccess: Implement macros for CMPXCHG on user addresses")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Message-Id: <485c0c0b-a3a7-0b7c-5264-7d00c01de032@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07 11:28:49 -04:00
Shaoqin Huang
cf4a8693d9 KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()
When freeing obsolete previous roots, check prev_roots as intended, not
the current root.

Signed-off-by: Shaoqin Huang <shaoqin.huang@intel.com>
Fixes: 527d5cd7eece ("KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped")
Message-Id: <20220607005905.2933378-1-shaoqin.huang@intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07 11:28:48 -04:00
Yanfei Xu
ffd1925a59 KVM: x86: Fix the intel_pt PMI handling wrongly considered from guest
When kernel handles the vm-exit caused by external interrupts and NMI,
it always sets kvm_intr_type to tell if it's dealing an IRQ or NMI. For
the PMI scenario, it could be IRQ or NMI.

However, intel_pt PMIs are only generated for HARDWARE perf events, and
HARDWARE events are always configured to generate NMIs.  Use
kvm_handling_nmi_from_guest() to precisely identify if the intel_pt PMI
came from the guest; this avoids false positives if an intel_pt PMI/NMI
arrives while the host is handling an unrelated IRQ VM-Exit.

Fixes: db215756ae59 ("KVM: x86: More precisely identify NMI from guest when handling PMI")
Signed-off-by: Yanfei Xu <yanfei.xu@intel.com>
Message-Id: <20220523140821.1345605-1-yanfei.xu@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25 05:18:27 -04:00
Paolo Bonzini
baec4f5a01 x86, kvm: use correct GFP flags for preemption disabled
Commit ddd7ed842627 ("x86/kvm: Alloc dummy async #PF token outside of
raw spinlock") leads to the following Smatch static checker warning:

	arch/x86/kernel/kvm.c:212 kvm_async_pf_task_wake()
	warn: sleeping in atomic context

arch/x86/kernel/kvm.c
    202         raw_spin_lock(&b->lock);
    203         n = _find_apf_task(b, token);
    204         if (!n) {
    205                 /*
    206                  * Async #PF not yet handled, add a dummy entry for the token.
    207                  * Allocating the token must be down outside of the raw lock
    208                  * as the allocator is preemptible on PREEMPT_RT kernels.
    209                  */
    210                 if (!dummy) {
    211                         raw_spin_unlock(&b->lock);
--> 212                         dummy = kzalloc(sizeof(*dummy), GFP_KERNEL);
                                                                ^^^^^^^^^^
Smatch thinks the caller has preempt disabled.  The `smdb.py preempt
kvm_async_pf_task_wake` output call tree is:

sysvec_kvm_asyncpf_interrupt() <- disables preempt
-> __sysvec_kvm_asyncpf_interrupt()
   -> kvm_async_pf_task_wake()

The caller is this:

arch/x86/kernel/kvm.c
   290        DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_asyncpf_interrupt)
   291        {
   292                struct pt_regs *old_regs = set_irq_regs(regs);
   293                u32 token;
   294
   295                ack_APIC_irq();
   296
   297                inc_irq_stat(irq_hv_callback_count);
   298
   299                if (__this_cpu_read(apf_reason.enabled)) {
   300                        token = __this_cpu_read(apf_reason.token);
   301                        kvm_async_pf_task_wake(token);
   302                        __this_cpu_write(apf_reason.token, 0);
   303                        wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1);
   304                }
   305
   306                set_irq_regs(old_regs);
   307        }

The DEFINE_IDTENTRY_SYSVEC() is a wrapper that calls this function
from the call_on_irqstack_cond().  It's inside the call_on_irqstack_cond()
where preempt is disabled (unless it's already disabled).  The
irq_enter/exit_rcu() functions disable/enable preempt.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25 05:13:40 -04:00
Wanpeng Li
619f51da09 KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer
The timer is disarmed when switching between TSC deadline and other modes;
however, the pending timer is still in-flight, so let's accurately remove
any traces of the previous mode.

Fixes: 4427593258 ("KVM: x86: thoroughly disarm LAPIC timer around TSC deadline switch")
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25 05:12:35 -04:00
Sean Christopherson
0547758a6d x86/kvm: Alloc dummy async #PF token outside of raw spinlock
Drop the raw spinlock in kvm_async_pf_task_wake() before allocating the
the dummy async #PF token, the allocator is preemptible on PREEMPT_RT
kernels and must not be called from truly atomic contexts.

Opportunistically document why it's ok to loop on allocation failure,
i.e. why the function won't get stuck in an infinite loop.

Reported-by: Yajun Deng <yajun.deng@linux.dev>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25 05:12:34 -04:00
Sean Christopherson
fee060cd52 KVM: x86: avoid calling x86 emulator without a decoded instruction
Whenever x86_decode_emulated_instruction() detects a breakpoint, it
returns the value that kvm_vcpu_check_breakpoint() writes into its
pass-by-reference second argument.  Unfortunately this is completely
bogus because the expected outcome of x86_decode_emulated_instruction
is an EMULATION_* value.

Then, if kvm_vcpu_check_breakpoint() does "*r = 0" (corresponding to
a KVM_EXIT_DEBUG userspace exit), it is misunderstood as EMULATION_OK
and x86_emulate_instruction() is called without having decoded the
instruction.  This causes various havoc from running with a stale
emulation context.

The fix is to move the call to kvm_vcpu_check_breakpoint() where it was
before commit 4aa2691dcbd3 ("KVM: x86: Factor out x86 instruction
emulation with decoding") introduced x86_decode_emulated_instruction().
The other caller of the function does not need breakpoint checks,
because it is invoked as part of a vmexit and the processor has already
checked those before executing the instruction that #GP'd.

This fixes CVE-2022-1852.

Reported-by: Qiuhao Li <qiuhao@sysec.org>
Reported-by: Gaoning Pan <pgn@zju.edu.cn>
Reported-by: Yongkang Jia <kangel@zju.edu.cn>
Fixes: 4aa2691dcbd3 ("KVM: x86: Factor out x86 instruction emulation with decoding")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220311032801.3467418-2-seanjc@google.com>
[Rewrote commit message according to Qiuhao's report, since a patch
 already existed to fix the bug. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25 05:12:05 -04:00
Ashish Kalra
d22d2474e3 KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak
For some sev ioctl interfaces, the length parameter that is passed maybe
less than or equal to SEV_FW_BLOB_MAX_SIZE, but larger than the data
that PSP firmware returns. In this case, kmalloc will allocate memory
that is the size of the input rather than the size of the data.
Since PSP firmware doesn't fully overwrite the allocated buffer, these
sev ioctl interface may return uninitialized kernel slab memory.

Reported-by: Andy Nguyen <theflow@google.com>
Suggested-by: David Rientjes <rientjes@google.com>
Suggested-by: Peter Gonda <pgonda@google.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: eaf78265a4ab3 ("KVM: SVM: Move SEV code to separate file")
Fixes: 2c07ded06427d ("KVM: SVM: add support for SEV attestation command")
Fixes: 4cfdd47d6d95a ("KVM: SVM: Add KVM_SEV SEND_START command")
Fixes: d3d1af85e2c75 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Fixes: eba04b20e4861 ("KVM: x86: Account a variety of miscellaneous allocations")
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Message-Id: <20220516154310.3685678-1-Ashish.Kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25 05:11:51 -04:00
Sean Christopherson
d187ba5312 x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave)
Set the starting uABI size of KVM's guest FPU to 'struct kvm_xsave',
i.e. to KVM's historical uABI size.  When saving FPU state for usersapce,
KVM (well, now the FPU) sets the FP+SSE bits in the XSAVE header even if
the host doesn't support XSAVE.  Setting the XSAVE header allows the VM
to be migrated to a host that does support XSAVE without the new host
having to handle FPU state that may or may not be compatible with XSAVE.

Setting the uABI size to the host's default size results in out-of-bounds
writes (setting the FP+SSE bits) and data corruption (that is thankfully
caught by KASAN) when running on hosts without XSAVE, e.g. on Core2 CPUs.

WARN if the default size is larger than KVM's historical uABI size; all
features that can push the FPU size beyond the historical size must be
opt-in.

  ==================================================================
  BUG: KASAN: slab-out-of-bounds in fpu_copy_uabi_to_guest_fpstate+0x86/0x130
  Read of size 8 at addr ffff888011e33a00 by task qemu-build/681
  CPU: 1 PID: 681 Comm: qemu-build Not tainted 5.18.0-rc5-KASAN-amd64 #1
  Hardware name:  /DG35EC, BIOS ECG3510M.86A.0118.2010.0113.1426 01/13/2010
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x45
   print_report.cold+0x45/0x575
   kasan_report+0x9b/0xd0
   fpu_copy_uabi_to_guest_fpstate+0x86/0x130
   kvm_arch_vcpu_ioctl+0x72a/0x1c50 [kvm]
   kvm_vcpu_ioctl+0x47f/0x7b0 [kvm]
   __x64_sys_ioctl+0x5de/0xc90
   do_syscall_64+0x31/0x50
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  Allocated by task 0:
  (stack is not available)
  The buggy address belongs to the object at ffff888011e33800
   which belongs to the cache kmalloc-512 of size 512
  The buggy address is located 0 bytes to the right of
   512-byte region [ffff888011e33800, ffff888011e33a00)
  The buggy address belongs to the physical page:
  page:0000000089cd4adb refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11e30
  head:0000000089cd4adb order:2 compound_mapcount:0 compound_pincount:0
  flags: 0x4000000000010200(slab|head|zone=1)
  raw: 4000000000010200 dead000000000100 dead000000000122 ffff888001041c80
  raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected
  Memory state around the buggy address:
   ffff888011e33900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   ffff888011e33980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  >ffff888011e33a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                     ^
   ffff888011e33a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   ffff888011e33b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ==================================================================
  Disabling lock debugging due to kernel taint

Fixes: be50b2065dfa ("kvm: x86: Add support for getting/setting expanded xstate buffer")
Fixes: c60427dd50ba ("x86/fpu: Add uabi_size to guest_fpu")
Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Zdenek Kaspar <zkaspar82@gmail.com>
Message-Id: <20220504001219.983513-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25 05:11:37 -04:00
Paolo Bonzini
b699da3dc2 KVM/riscv changes for 5.19
- Added Sv57x4 support for G-stage page table
 - Added range based local HFENCE functions
 - Added remote HFENCE functions based on VCPU requests
 - Added ISA extension registers in ONE_REG interface
 - Updated KVM RISC-V maintainers entry to cover selftests support
 -----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmKHGu8ACgkQrUjsVaLH
 LAe1sQ/40ltbl/v0cW+zkuUOem+apmJMhtoCfh2Pv00yUYftUNw01Uu+NN04T70x
 PYwbu0O8j4dgIFNRPU7VQBVI+fJydkgEr3kpk8UOCCGKiE0NAcFoQv70ngPObc4W
 L425i2RviZuQUXLTFsoLOb246p8V8lkfbEQKqWksFEROYWFbdNKmaLpfVqq3Bia2
 +G8L2OyAHGjUXgIdOnflZHxowJg4ueGob3iH+4AhZNUpIQYtlKSfi/eo0vmzf5Uz
 bD35o6y4G7NnZJyZoKb3QAEt0WQ55YDsNN62XrULQ7GEuWnpez+Jhw3jtrAr59Q7
 m8n93NMKKJ9CbnsspFJ+4nHCd2Gb4i99Py70IW6Ro22DL8KRrLDv2ZQi3dJCGrAT
 MtER+12coglkgjhDmLn6MMEjWkgbXXxQCEs4OQ8VMORtHAsOQEszu5TCEnihXr2q
 +uUZ5O0G6eDowctOVMTdqVMtj1u1AT7fZ68evvk4omNnoFWjkQzd4sVPNDJtK+nC
 7mA9IUyC2LSvr/oNNpcuIZsKU6OzQUQ5ISTMpbP/HJInFcvYbJTl0I8UcvjzlImo
 81CZTUQOY9kQE+VUTHcGqPr0TjN/YlfF//koiCfeTycN0jbRZZ9rpcRQ38R8sDsS
 yy7JQqwpi/x8me9ldt5r19ky5zMlCKpnQfGX6ws+umhqVEHBKw==
 =Xznv
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-5.19-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv changes for 5.19

- Added Sv57x4 support for G-stage page table
- Added range based local HFENCE functions
- Added remote HFENCE functions based on VCPU requests
- Added ISA extension registers in ONE_REG interface
- Updated KVM RISC-V maintainers entry to cover selftests support
2022-05-25 05:09:49 -04:00
Paolo Bonzini
47e8eec832 KVM/arm64 updates for 5.19
- Add support for the ARMv8.6 WFxT extension
 
 - Guard pages for the EL2 stacks
 
 - Trap and emulate AArch32 ID registers to hide unsupported features
 
 - Ability to select and save/restore the set of hypercalls exposed
   to the guest
 
 - Support for PSCI-initiated suspend in collaboration with userspace
 
 - GICv3 register-based LPI invalidation support
 
 - Move host PMU event merging into the vcpu data structure
 
 - GICv3 ITS save/restore fixes
 
 - The usual set of small-scale cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmKGAGsPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDB/gQAMhyZ+wCG0OMEZhwFF6iDfxVEX2Kw8L41NtD
 a/e6LDWuIOGihItpRkYROc5myG74D7XckF2Bz3G7HJoU4vhwHOV/XulE26GFizoC
 O1GVRekeSUY81wgS1yfo0jojLupBkTjiq3SjTHoDP7GmCM0qDPBtA0QlMRzd2bMs
 Kx0+UUXZUHFSTXc7Lp4vqNH+tMp7se+yRx7hxm6PCM5zG+XYJjLxnsZ0qpchObgU
 7f6YFojsLUs1SexgiUqJ1RChVQ+FkgICh5HyzORvGtHNNzK6D2sIbsW6nqMGAMql
 Kr3A5O/VOkCztSYnLxaa76/HqD21mvUrXvr3grhabNc7rOmuzWV0dDgr6c6wHKHb
 uNCtH4d7Ra06gUrEOrfsgLOLn0Zqik89y6aIlMsnTudMg9gMNgFHy1jz4LM7vMkY
 FS5AVj059heg2uJcfgTvzzcqneyuBLBmF3dS4coowO6oaj8SycpaEmP5e89zkPMI
 1kk8d0e6RmXuCh/2AJ8GxxnKvBPgqp2mMKXOCJ8j4AmHEDX/CKpEBBqIWLKkplUU
 8DGiOWJUtRZJg398dUeIpiVLoXJthMODjAnkKkuhiFcQbXomlwgg7YSnNAz6TRED
 Z7KR2leC247kapHnnagf02q2wED8pBeyrxbQPNdrHtSJ9Usm4nTkY443HgVTJW3s
 aTwPZAQ7
 =mh7W
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 5.19

- Add support for the ARMv8.6 WFxT extension

- Guard pages for the EL2 stacks

- Trap and emulate AArch32 ID registers to hide unsupported features

- Ability to select and save/restore the set of hypercalls exposed
  to the guest

- Support for PSCI-initiated suspend in collaboration with userspace

- GICv3 register-based LPI invalidation support

- Move host PMU event merging into the vcpu data structure

- GICv3 ITS save/restore fixes

- The usual set of small-scale cleanups and fixes

[Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated
 from 4 to 6. - Paolo]
2022-05-25 05:09:23 -04:00
Wanpeng Li
e0ac535178 KVM: LAPIC: Trace LAPIC timer expiration on every vmentry
In commit ec0671d5684a ("KVM: LAPIC: Delay trace_kvm_wait_lapic_expire
tracepoint to after vmexit", 2019-06-04), trace_kvm_wait_lapic_expire
was moved after guest_exit_irqoff() because invoking tracepoints within
kvm_guest_enter/kvm_guest_exit caused a lockdep splat.

These days this is not necessary, because commit 87fa7f3e98a1 ("x86/kvm:
Move context tracking where it belongs", 2020-07-09) restricted
the RCU extended quiescent state to be closer to vmentry/vmexit.
Moving the tracepoint back to __kvm_wait_lapic_expire is more accurate,
because it will be reported even if vcpu_enter_guest causes multiple
vmentries via the IPI/Timer fast paths, and it allows the removal of
advance_expire_delta.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1650961551-38390-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25 05:07:09 -04:00
Adrian-Ken Rueegsegger
280abe14b6 x86/mm: Fix marking of unused sub-pmd ranges
The unused part precedes the new range spanned by the start, end parameters
of vmemmap_use_new_sub_pmd(). This means it actually goes from
ALIGN_DOWN(start, PMD_SIZE) up to start.

Use the correct address when applying the mark using memset.

Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Signed-off-by: Adrian-Ken Rueegsegger <ken@codelabs.ch>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220509090637.24152-2-ken@codelabs.ch
2022-05-13 12:41:21 +02:00
Vipin Sharma
6ba1e04fa6 KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmaps
Avoid calling handlers on empty rmap entries and skip to the next non
empty rmap entry.

Empty rmap entries are noop in handlers.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220502220347.174664-1-vipinsh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:45 -04:00
Kai Huang
3c5c32457d KVM: VMX: Include MKTME KeyID bits in shadow_zero_check
Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should
never be set to SPTE.  Set shadow_me_value to 0 and shadow_me_mask to
include all MKTME KeyID bits to include them to shadow_zero_check.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:45 -04:00
Kai Huang
e54f1ff244 KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits.  Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits.  TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode.  And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either.  Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.

KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory.  MKTME KeyID bits should be set
to shadow_zero_check.  Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check.  So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).

Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:

 - shadow_me_value: the memory encryption bit(s) that will be set to the
   SPTE (the original shadow_me_mask).
 - shadow_me_mask: all possible memory encryption bits (which is a super
   set of shadow_me_value).
 - For now, shadow_me_value is supposed to be set by SVM and VMX
   respectively, and it is a constant during KVM's life time.  This
   perhaps doesn't fit MKTME but for now host kernel doesn't support it
   (and perhaps will never do).
 - Bits in shadow_me_mask are set to shadow_zero_check, except the bits
   in shadow_me_value.

Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed.  This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().

Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:44 -04:00
Kai Huang
c919e881ba KVM: x86/mmu: Rename reset_rsvds_bits_mask()
Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make
it clearer that it resets the reserved bits check for guest's page table
entries.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <efdc174b85d55598880064b8bf09245d3791031d.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:44 -04:00
Paolo Bonzini
c9f3d9fbcd KVM: x86: a vCPU with a pending triple fault is runnable
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:44 -04:00
Sean Christopherson
1075d41efd KVM: x86/mmu: Expand and clean up page fault stats
Expand and clean up the page fault stats.  The current stats are at best
incomplete, and at worst misleading.  Differentiate between faults that
are actually fixed vs those that result in an MMIO SPTE being created,
track faults that are spurious, faults that trigger emulation, faults
that that are fixed in the fast path, and last but not least, track the
number of faults that are taken.

Note, the number of faults that require emulation for write-protected
shadow pages can roughly be calculated by subtracting the number of MMIO
SPTEs created from the overall number of faults that trigger emulation.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:43 -04:00
Sean Christopherson
8d5265b101 KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults
Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast
path for TDP page faults.  The generated code is identical, and the #ifdef
makes it dangerously difficult to extend the logic (guess who forgot to
add an "else" inside the #ifdef and ran through the page fault handler
twice).

No functional or binary change intented.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:43 -04:00
Sean Christopherson
8a009d5bca KVM: x86/mmu: Make all page fault handlers internal to the MMU
Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
of the page fault handling collateral that was in mmu.h purely for the
async #PF handler into mmu_internal.h, where it belongs.  This will allow
kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
expose those enums outside of the MMU.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:42 -04:00
Sean Christopherson
5276c616ab KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns"
Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
kvm_faultin_pfn() to signal that the page fault handler should continue
doing its thing.  Aside from being gross and inefficient, using a boolean
return to signal continue vs. stop makes it extremely difficult to add
more helpers and/or move existing code to a helper.

E.g. hypothetically, if nested MMUs were to gain a separate page fault
handler in the future, everything up to the "is self-modifying PTE" check
can be shared by all shadow MMUs, but communicating up the stack whether
to continue on or stop becomes a nightmare.

More concretely, proposed support for private guest memory ran into a
similar issue, where it'll be forced to forego a helper in order to yield
sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.

No functional change intended.

Cc: David Matlack <dmatlack@google.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:42 -04:00
Sean Christopherson
5c64aba517 KVM: x86/mmu: Drop exec/NX check from "page fault can be fast"
Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
faults in the access tracking case, and drop the exec/NX check that
becomes redundant as a result.  No sane hardware will generate an access
that is both an instruct fetch and a write, i.e. it's a waste of cycles.
If hardware goes off the rails, or KVM runs under a misguided hypervisor,
spuriously running throught fast path is benign (KVM has been uknowingly
being doing exactly that for years).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:41 -04:00
Sean Christopherson
54275f74cf KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path.  Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.

In other words, don't attempt the fast path just because EPT is enabled.

Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small.  E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios.  Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.

The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults.  Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.

On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles.  So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.

Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:41 -04:00
Li RongQing
91ab933f75 KVM: VMX: clean up pi_wakeup_handler
Passing per_cpu() to list_for_each_entry() causes the macro to be
evaluated N+1 times for N sleeping vCPUs.  This is a very small
inefficiency, and the code is cleaner if the address of the per-CPU
variable is loaded earlier.  Do this for both the list and the spinlock.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-Id: <1649244302-6777-1-git-send-email-lirongqing@baidu.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:40 -04:00
Maxim Levitsky
33fbe6befa KVM: x86: fix typo in __try_cmpxchg_user causing non-atomicness
This shows up as a TDP MMU leak when running nested.  Non-working cmpxchg on L0
relies makes L1 install two different shadow pages under same spte, and one of
them is leaked.

Fixes: 1c2361f667f36 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220512101420.306759-1-mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:26 -04:00
Linus Torvalds
27b5d61c0c A fix and an email address update:
- Prevent FPU state corruption. The condition in irq_fpu_usable() grants
    FPU usage when the FPU is not used in the kernel. That's just wrong as
    it does not take the fpregs_lock()'ed regions into account. If FPU usage
    happens within such a region from interrupt context, then the FPU state
    gets corrupted. That's a long standing bug, which got unearthed by the
    recent changes to the random code.
 
  - Josh wants to use his kernel.org email address
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJ3sb0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRR9EACOcJAkO4ZjHvQf8RDw4ZaC/d0PgEC1
 rEcxL7Tq9qAjdY+VmoRdzAia1FbKWrSNzENiBaTwdM2dxsZN0cl5fEQAy5ffHKXr
 IadRIHICu6INKQ0iuf4VdOt8HuMC+Ams9sFoVDId1avRoejsjIHeCpgBen+0/LQf
 D4i+nvUL9hMcZDsWiQW9mTe8J4fqr7rrg+p7tD0300DbZ6/PFx+zWP58TE8K7vQ8
 dsmfMXxDrJW3d9FOHHvPQXa/Okdm2fHxXuxs3Quc+7HG6cMcwefCYugf8HK3E14F
 q0O6IAOfiYzCL+8aNo4J3H5jPEGLMJ7JlY5Yoygc1mcx0uGyVraMbFOsK8WuRFvP
 eAmx31Wh6EIYOwaboSG+74k/b3hPa6Hx3R7aQDS+SnQQI6I9fdi3ZZtQ+DGnZBZG
 Ipq/f+EjaROh1atUwhE4zM80UKSU6RWEWAlMO4K07uO8a3RnR8qV7N8tl44i+Q7k
 KZUbN5/aV4ccZNwMbazcpZ32fe3SB9cD4e/aLqpMp0uOl9TVxcOA3hIkQ0wflh94
 6XO+gPdvr5VxWayc9tljMXUGPxwjTN4zDKUIlZP2EzYHt6SyZpdwi2+8moEfvU+a
 qcIWPLeXb+972LaY+rTicT4cQxCKe0CZEXCOq1ns+Ni5f5TdKkvyxpeMIOrGtjYG
 /4RqWncPKIyuEw==
 =PpOB
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2022-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Thomas Gleixner:
 "A fix and an email address update:

   - Prevent FPU state corruption.

     The condition in irq_fpu_usable() grants FPU usage when the FPU is
     not used in the kernel. That's just wrong as it does not take the
     fpregs_lock()'ed regions into account. If FPU usage happens within
     such a region from interrupt context, then the FPU state gets
     corrupted.

     That's a long standing bug, which got unearthed by the recent
     changes to the random code.

   - Josh wants to use his kernel.org email address"

* tag 'x86-urgent-2022-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Prevent FPU state corruption
  MAINTAINERS: Update Josh Poimboeuf's email address
2022-05-08 11:21:54 -07:00
Linus Torvalds
bce58da1f3 x86:
* Account for family 17h event renumberings in AMD PMU emulation
 
 * Remove CPUID leaf 0xA on AMD processors
 
 * Fix lockdep issue with locking all vCPUs
 
 * Fix loss of A/D bits in SPTEs
 
 * Fix syzkaller issue with invalid guest state
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmJ1Vf4UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNaUQgAgygZ2KsejlJCYGtEkAsjcpdzmPVL
 8j42nWB673/PLZ6GrDXcFnRwQaBIT+0YrES5VHTkTI996d2T/yHII2L4G3DQtUGm
 6L3qYqrjJlX2WjbYGvYzkJ6m4EzcstUfPYNO2Qzfvbl2y/wz64HlAhNdymwMX2UU
 GPUVoo3EHeobJdZVKFMe7eI6r/uY1/uPdsKqNjnlWI73op+tc7mMRN5+SlQDgQvR
 kmzw+Nk0J+PERQO+D+fm1vUdXDQ8hiI7LtTBIUX7rf47IqVlHNHC8frC94PX3W3E
 l2sVS+LzRQRqCgFgQ2ay2gYkl078VL8z4A6vWpcWSmaToEYE7VcAnHqb0Q==
 =6gt2
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "x86:

   - Account for family 17h event renumberings in AMD PMU emulation

   - Remove CPUID leaf 0xA on AMD processors

   - Fix lockdep issue with locking all vCPUs

   - Fix loss of A/D bits in SPTEs

   - Fix syzkaller issue with invalid guest state"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: VMX: Exit to userspace if vCPU has injected exception and invalid state
  KVM: SEV: Mark nested locking of vcpu->lock
  kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU
  KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id
  KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
  KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits()
  KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D)
2022-05-06 11:42:58 -07:00
Sean Christopherson
053d2290c0 KVM: VMX: Exit to userspace if vCPU has injected exception and invalid state
Exit to userspace with an emulation error if KVM encounters an injected
exception with invalid guest state, in addition to the existing check of
bailing if there's a pending exception (KVM doesn't support emulating
exceptions except when emulating real mode via vm86).

In theory, KVM should never get to such a situation as KVM is supposed to
exit to userspace before injecting an exception with invalid guest state.
But in practice, userspace can intervene and manually inject an exception
and/or stuff registers to force invalid guest state while a previously
injected exception is awaiting reinjection.

Fixes: fc4fad79fc3d ("KVM: VMX: Reject KVM_RUN if emulation is required with pending exception")
Reported-by: syzbot+cfafed3bb76d3e37581b@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220502221850.131873-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-06 13:08:06 -04:00
Peter Gonda
0c2c7c0692 KVM: SEV: Mark nested locking of vcpu->lock
svm_vm_migrate_from() uses sev_lock_vcpus_for_migration() to lock all
source and target vcpu->locks. Unfortunately there is an 8 subclass
limit, so a new subclass cannot be used for each vCPU. Instead maintain
ownership of the first vcpu's mutex.dep_map using a role specific
subclass: source vs target. Release the other vcpu's mutex.dep_maps.

Fixes: b56639318bb2b ("KVM: SEV: Add support for SEV intra host migration")
Reported-by: John Sperbeck<jsperbeck@google.com>
Suggested-by: David Rientjes <rientjes@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Gonda <pgonda@google.com>

Message-Id: <20220502165807.529624-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-06 13:08:04 -04:00
Thomas Gleixner
59f5ede3bc x86/fpu: Prevent FPU state corruption
The FPU usage related to task FPU management is either protected by
disabling interrupts (switch_to, return to user) or via fpregs_lock() which
is a wrapper around local_bh_disable(). When kernel code wants to use the
FPU then it has to check whether it is possible by calling irq_fpu_usable().

But the condition in irq_fpu_usable() is wrong. It allows FPU to be used
when:

   !in_interrupt() || interrupted_user_mode() || interrupted_kernel_fpu_idle()

The latter is checking whether some other context already uses FPU in the
kernel, but if that's not the case then it allows FPU to be used
unconditionally even if the calling context interrupted a fpregs_lock()
critical region. If that happens then the FPU state of the interrupted
context becomes corrupted.

Allow in kernel FPU usage only when no other context has in kernel FPU
usage and either the calling context is not hard interrupt context or the
hard interrupt did not interrupt a local bottomhalf disabled region.

It's hard to find a proper Fixes tag as the condition was broken in one way
or the other for a very long time and the eager/lazy FPU changes caused a
lot of churn. Picked something remotely connected from the history.

This survived undetected for quite some time as FPU usage in interrupt
context is rare, but the recent changes to the random code unearthed it at
least on a kernel which had FPU debugging enabled. There is probably a
higher rate of silent corruption as not all issues can be detected by the
FPU debugging code. This will be addressed in a subsequent change.

Fixes: 5d2bd7009f30 ("x86, fpu: decouple non-lazy/eager fpu restore from xsave")
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220501193102.588689270@linutronix.de
2022-05-05 02:40:19 +02:00
Paolo Bonzini
9913288318 Merge branch 'kvm-amd-pmu-fixes' into HEAD 2022-05-03 08:09:13 -04:00
Paolo Bonzini
04144108a1 Merge branch 'kvm-amd-pmu-fixes' into HEAD 2022-05-03 08:07:54 -04:00
Sandipan Das
5a1bde46f9 kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU
On some x86 processors, CPUID leaf 0xA provides information
on Architectural Performance Monitoring features. It
advertises a PMU version which Qemu uses to determine the
availability of additional MSRs to manage the PMCs.

Upon receiving a KVM_GET_SUPPORTED_CPUID ioctl request for
the same, the kernel constructs return values based on the
x86_pmu_capability irrespective of the vendor.

This leaf and the additional MSRs are not supported on AMD
and Hygon processors. If AMD PerfMonV2 is detected, the PMU
version is set to 2 and guest startup breaks because of an
attempt to access a non-existent MSR. Return zeros to avoid
this.

Fixes: a6c06ed1a60a ("KVM: Expose the architectural performance monitoring CPUID leaf")
Reported-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Message-Id: <3fef83d9c2b2f7516e8ff50d60851f29a4bcb716.1651058600.git.sandipan.das@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03 08:05:08 -04:00
Kyle Huey
5eb849322d KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id
Zen renumbered some of the performance counters that correspond to the
well known events in perf_hw_id. This code in KVM was never updated for
that, so guest that attempt to use counters on Zen that correspond to the
pre-Zen perf_hw_id values will silently receive the wrong values.

This has been observed in the wild with rr[0] when running in Zen 3
guests. rr uses the retired conditional branch counter 00d1 which is
incorrectly recognized by KVM as PERF_COUNT_HW_STALLED_CYCLES_BACKEND.

[0] https://rr-project.org/

Signed-off-by: Kyle Huey <me@kylehuey.com>
Message-Id: <20220503050136.86298-1-khuey@kylehuey.com>
Cc: stable@vger.kernel.org
[Check guest family, not host. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03 07:56:53 -04:00
Paolo Bonzini
6ea6581f12 Merge branch 'kvm-tdp-mmu-atomicity-fix' into HEAD
We are dropping A/D bits (and W bits) in the TDP MMU.  Even if mmu_lock
is held for write, as volatile SPTEs can be written by other tasks/vCPUs
outside of mmu_lock.

Attempting to prove that bug exposed another notable goof, which has been
lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs
as volatile, even though KVM never clears WRITABLE outside of MMU lock.
As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to
update writable SPTEs.

The fix does not seem to have an easily-measurable affect on performance;
page faults are so slow that wasting even a few hundred cycles is dwarfed
by the base cost.
2022-05-03 07:29:30 -04:00
Paolo Bonzini
4f510c8bb1 Merge branch 'kvm-tdp-mmu-atomicity-fix' into HEAD
We are dropping A/D bits (and W bits) in the TDP MMU.  Even if mmu_lock
is held for write, as volatile SPTEs can be written by other tasks/vCPUs
outside of mmu_lock.

Attempting to prove that bug exposed another notable goof, which has been
lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs
as volatile, even though KVM never clears WRITABLE outside of MMU lock.
As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to
update writable SPTEs.

The fix does not seem to have an easily-measurable affect on performance;
page faults are so slow that wasting even a few hundred cycles is dwarfed
by the base cost.
2022-05-03 07:23:08 -04:00
Sean Christopherson
ba3a6120a4 KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock.  If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.

Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.

Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page.  In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush.  If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.

For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.

The Accessed bit is a complete non-issue.  Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.

Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE.  If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM.  But that all depends
on the Dirty bit being problematic in the first place.

Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03 07:22:32 -04:00
Sean Christopherson
54eb3ef56f KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits()
Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
and into its callers.  Well, caller, since only one of its two callers
doesn't already do the shadow-present check.

Opportunistically move the helper to spte.c/h so that it can be used by
the TDP MMU, which is also the primary motivation for the shadow-present
change.  Unlike the legacy MMU, the TDP MMU uses a single path for clear
leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
MMU will need to check is_last_spte() prior to calling
spte_has_volatile_bits(), and calling is_last_spte() without first
calling is_shadow_present_spte() is at best odd, and at worst a violation
of KVM's loosely defines SPTE rules.

Note, mmu_spte_clear_track_bits() could likely skip the write entirely
for SPTEs that are not shadow-present.  Leave that cleanup for a future
patch to avoid introducing a functional change, and because the
shadow-present check can likely be moved further up the stack, e.g.
drop_large_spte() appears to be the only path that doesn't already
explicitly check for a shadow-present SPTE.

No functional change intended.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03 07:22:32 -04:00
Sean Christopherson
706c9c55e5 KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D)
Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
being volatile (unless they're volatile for other reasons, e.g. A/D bits).
KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
cleared just because the SPTE is MMU-writable.

Rename the wrapper of MMU-writable to be more literal, the previous name
of spte_can_locklessly_be_made_writable() is wrong and misleading.

Fixes: c7ba5b48cc8d ("KVM: MMU: fast path of handling guest page fault")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03 07:22:31 -04:00
Yuan Yao
c180269d27 KVM: VMX: Use vcpu_to_pi_desc() uniformly in posted_intr.c
The helper function, vcpu_to_pi_desc(), is defined to get the posted
interrupt descriptor from vcpu.  There is one place that doesn't use
it, and instead references vmx_vcpu->pi_desc directly.  Remove the
inconsistency.

Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <ee7be7832bc424546fd4f05015a844a0205b5ba2.1646422845.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-02 11:42:43 -04:00
Maxim Levitsky
6fcee03df6 KVM: x86: avoid loading a vCPU after .vm_destroy was called
This can cause various unexpected issues, since VM is partially
destroyed at that point.

For example when AVIC is enabled, this causes avic_vcpu_load to
access physical id page entry which is already freed by .vm_destroy.

Fixes: 8221c1370056 ("svm: Manage vcpu load/unload when enable AVIC")
Cc: stable@vger.kernel.org

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-02 11:42:42 -04:00
Linus Torvalds
b6b2648911 ARM:
* Take care of faults occuring between the PARange and
   IPA range by injecting an exception
 
 * Fix S2 faults taken from a host EL0 in protected mode
 
 * Work around Oops caused by a PMU access from a 32bit
   guest when PMU has been created. This is a temporary
   bodge until we fix it for good.
 
 x86:
 
 * Fix potential races when walking host page table
 
 * Fix shadow page table leak when KVM runs nested
 
 * Work around bug in userspace when KVM synthesizes leaf
   0x80000021 on older (pre-EPYC) or Intel processors
 
 Generic (but affects only RISC-V):
 
 * Fix bad user ABI for KVM_EXIT_SYSTEM_EVENT
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmJuxI4UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNjfQf/X4Rn6+sTkXRS0UHWEu+q9FjJ+mIx
 ZUWdbncf0brUB1RPAFfKaiQHo0t2Req+iTlpqZL0nVQ4myNUelHYube/sZdK/aBR
 WOjKZE0hugGyMH3js2bsTdgzbcphThyYAX97qGZNb7tsPGhBiw7c98KhjxlieJab
 D8LMNtM3uzPDxg422GfOm8ge2VbpySS5oRoGHfbD+4FiLYlXoCYfZuzlFwFFIGxw
 uHm5zzfX5jshayFpFYVSJHtARXlpwJWKz9yl63QjHrhVitW4m5j4re3aNfboL6Pd
 F5Z9K+DKhJLAH5cqmgiPPe2CGMvmRwKrN3F9MqV91xDPBT8J4rrowEeboQ==
 =SwSU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Take care of faults occuring between the PARange and IPA range by
     injecting an exception

   - Fix S2 faults taken from a host EL0 in protected mode

   - Work around Oops caused by a PMU access from a 32bit guest when PMU
     has been created. This is a temporary bodge until we fix it for
     good.

  x86:

   - Fix potential races when walking host page table

   - Fix shadow page table leak when KVM runs nested

   - Work around bug in userspace when KVM synthesizes leaf 0x80000021
     on older (pre-EPYC) or Intel processors

  Generic (but affects only RISC-V):

   - Fix bad user ABI for KVM_EXIT_SYSTEM_EVENT"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: work around QEMU issue with synthetic CPUID leaves
  Revert "x86/mm: Introduce lookup_address_in_mm()"
  KVM: x86/mmu: fix potential races when walking host page table
  KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT
  KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  KVM: arm64: Inject exception on out-of-IPA-range translation fault
  KVM/arm64: Don't emulate a PMU for 32-bit guests if feature not set
  KVM: arm64: Handle host stage-2 faults from 32-bit EL0
2022-05-01 11:49:32 -07:00
Linus Torvalds
b2da7df52e - A fix to disable PCI/MSI[-X] masking for XEN_HVM guests as that is
solely controlled by the hypervisor
 
 - A build fix to make the function prototype (__warn()) as visible as
 the definition itself
 
 - A bunch of objtool annotation fixes which have accumulated over time
 
 - An ORC unwinder fix to handle bad input gracefully
 
 - Well, we thought the microcode gets loaded in time in order to restore
 the microcode-emulated MSRs but we thought wrong. So there's a fix for
 that to have the ordering done properly
 
 - Add new Intel model numbers
 
 - A spelling fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJucwMACgkQEsHwGGHe
 VUpgiw/8CuOXJhHSuYscEfAmPGoiG9+oLTYVc1NEfJEIyNuZULcr+aYlddTF79hm
 V+Flq6FyA3NU220F8t5s3jOaDkWjWJ8nZGPUUxo5+yNHugIGYh/kLy6w8LC8SgLq
 GqqYX4fd28tqFSgIBCrr+9GgpTE7bvzBGYLByKj9AO6ecLvWJmc+bENQCTaTRFgl
 og6xenzyECWxgbWIql0UeB1xw2AJ8UfYVeLKzOHpc95ZF209+mg7JLL5yIxwwgNV
 /CGoh28+twjX5SA1rr3cUx9gmFzrYubYZMglhgugBsShkdfuMLhis4woU7lF7cV9
 HnxH6mkvN4R0Im7DZXgQPJ63ZFLJ8tN3RyLQDYBRd71w0Epr/K2aacYeQkWTflcx
 4Ia+AiJ7rpKx0cUbUHX7pf3lzna/c8u/xPnlAIbR6rfwXO5mACupaofN5atAdx9T
 9rPCPIdroM5XzBTiN4aNJHEsADL1h/oQdzrziTwryyezbTtnNC5KW53hnqyf5Bqo
 gBlbfVsnwM0AfLHSPE1D0liOR2spwuB+/bWrsOCzEYENC44nDxHE/MUUjg7/l+Vr
 6N5syrQ7QsIPqUaEM+bQdKHGaXSU6amF8OWpFMjzkleQw5m7/X8LzyZsBlB4yeqv
 63hUEpdmFyR/6bLdEvjUXeAPcbA41WHwOMdNPaKDqn3zhwYZaa4=
 =poyP
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.18_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - A fix to disable PCI/MSI[-X] masking for XEN_HVM guests as that is
   solely controlled by the hypervisor

 - A build fix to make the function prototype (__warn()) as visible as
   the definition itself

 - A bunch of objtool annotation fixes which have accumulated over time

 - An ORC unwinder fix to handle bad input gracefully

 - Well, we thought the microcode gets loaded in time in order to
   restore the microcode-emulated MSRs but we thought wrong. So there's
   a fix for that to have the ordering done properly

 - Add new Intel model numbers

 - A spelling fix

* tag 'x86_urgent_for_v5.18_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/pci/xen: Disable PCI/MSI[-X] masking for XEN_HVM guests
  bug: Have __warn() prototype defined unconditionally
  x86/Kconfig: fix the spelling of 'becoming' in X86_KERNEL_IBT config
  objtool: Use offstr() to print address of missing ENDBR
  objtool: Print data address for "!ENDBR" data warnings
  x86/xen: Add ANNOTATE_NOENDBR to startup_xen()
  x86/uaccess: Add ENDBR to __put_user_nocheck*()
  x86/retpoline: Add ANNOTATE_NOENDBR for retpolines
  x86/static_call: Add ANNOTATE_NOENDBR to static call trampoline
  objtool: Enable unreachable warnings for CLANG LTO
  x86,objtool: Explicitly mark idtentry_body()s tail REACHABLE
  x86,objtool: Mark cpu_startup_entry() __noreturn
  x86,xen,objtool: Add UNWIND hint
  lib/strn*,objtool: Enforce user_access_begin() rules
  MAINTAINERS: Add x86 unwinding entry
  x86/unwind/orc: Recheck address range after stack info was updated
  x86/cpu: Load microcode during restore_processor_state()
  x86/cpu: Add new Alderlake and Raptorlake CPU model numbers
2022-05-01 10:03:36 -07:00
Linus Torvalds
b70ed23c23 - A bunch of objtool fixes to improve unwinding, sibling call detection,
fallthrough detection and relocation handling of weak symbols when the
 toolchain strips section symbols
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJuckgACgkQEsHwGGHe
 VUrnTw//TQ1gcAYX4vNibZvOYLRS090uvrnfrosCLBTlOLuPTnB71hTTCxaV6wPV
 lXbW5n795G9XmQAkKyqRjNA2PHGKP+D187ooFwJjHW661+dQgdo4EhbRtR4s/IMW
 Vd3ZRL0bmCImPKz4MrSVPEL0UotMHI2XYwr6Wf/kOmJ6nlTgmnVE3dI4sOkXQCtJ
 ZMCtSm6XN4LTnYLgkP99AuPQe4tC2Fw/zXkFZWkm3Ku6xvEtyfSLLByli8Tqf4p9
 mcVoLfBnvYc6ift/tBg9tGFTdw8BzQdmhvnwgMnouiA7bjuhEZ+ef7+LwEpg/5J6
 tMNIeO9m8DzR1jZm2vuu+VHB+GwYonXhElJY8JbpGfvI/zjYhxHNdyx3Nn9Cpd7B
 whxu7dRodUmI78/Ab3ywA+rDbMQw9ljT4254JhA/VeHxWuKodWU5PKRcS9nYSR+p
 NNSSxWmzy4+3h4d9Twd35CWa7ALroepr4JjyEs54Xar7kmoZhiFg8/P0cD2u5ZtL
 aBuDDOw8sQOzFHY8sQpYr4k4sI7VdA8fOBXJ0bllu962Gg1aujfuHlCP/ToRpJGc
 2YXXUI0tWmOsn5pGI5ludAQ5B+M0j1JxrowEb+gPfuqk7hoN53c4fery4JjtrsJ5
 0DPsSKq9SVY+SSLNTuTchQUBZcWAY3GXZYBHr8KuV+iY1zL/rCg=
 =7nEx
 -----END PGP SIGNATURE-----

Merge tag 'objtool_urgent_for_v5.18_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool fixes from Borislav Petkov:
 "A bunch of objtool fixes to improve unwinding, sibling call detection,
  fallthrough detection and relocation handling of weak symbols when the
  toolchain strips section symbols"

* tag 'objtool_urgent_for_v5.18_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Fix code relocs vs weak symbols
  objtool: Fix type of reloc::addend
  objtool: Fix function fallthrough detection for vmlinux
  objtool: Fix sibling call detection in alternatives
  objtool: Don't set 'jump_dest' for sibling calls
  x86/uaccess: Don't jump between functions
2022-05-01 09:34:54 -07:00
Paolo Bonzini
f751d8eac1 KVM: x86: work around QEMU issue with synthetic CPUID leaves
Synthesizing AMD leaves up to 0x80000021 caused problems with QEMU,
which assumes the *host* CPUID[0x80000000].EAX is higher or equal
to what KVM_GET_SUPPORTED_CPUID reports.

This causes QEMU to issue bogus host CPUIDs when preparing the input
to KVM_SET_CPUID2.  It can even get into an infinite loop, which is
only terminated by an abort():

   cpuid_data is full, no space for cpuid(eax:0x8000001d,ecx:0x3e)

To work around this, only synthesize those leaves if 0x8000001d exists
on the host.  The synthetic 0x80000021 leaf is mostly useful on Zen2,
which satisfies the condition.

Fixes: f144c49e8c39 ("KVM: x86: synthesize CPUID leaf 0x80000021h if useful")
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 15:24:58 -04:00