40166 Commits

Author SHA1 Message Date
Paolo Bonzini
006a0f0607 KVM: x86: avoid out of bounds indices for fixed performance counters
Because IceLake has 4 fixed performance counters but KVM only
supports 3, it is possible for reprogram_fixed_counters to pass
to reprogram_fixed_counter an index that is out of bounds for the
fixed_pmc_events array.

Ultimately intel_find_fixed_event, which is the only place that uses
fixed_pmc_events, handles this correctly because it checks against the
size of fixed_pmc_events anyway.  Every other place operates on the
fixed_counters[] array which is sized according to INTEL_PMC_MAX_FIXED.
However, it is cleaner if the unsupported performance counters are culled
early on in reprogram_fixed_counters.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:40 -05:00
Lai Jiangshan
5b61178cd2 KVM: VMX: Mark VCPU_EXREG_CR3 dirty when !CR0_PG -> CR0_PG if EPT + !URG
When !CR0_PG -> CR0_PG, vcpu->arch.cr3 becomes active, but GUEST_CR3 is
still vmx->ept_identity_map_addr if EPT + !URG.  So VCPU_EXREG_CR3 is
considered to be dirty and GUEST_CR3 needs to be updated in this case.

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211216021938.11752-4-jiangshanlai@gmail.com>
Fixes: c62c7bd4f95b ("KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:40 -05:00
Lai Jiangshan
6b123c3a89 KVM: x86/mmu: Reconstruct shadow page root if the guest PDPTEs is changed
For shadow paging, the page table needs to be reconstructed before the
coming VMENTER if the guest PDPTEs is changed.

But not all paths that call load_pdptrs() will cause the page tables to be
reconstructed. Normally, kvm_mmu_reset_context() and kvm_mmu_free_roots()
are used to launch later reconstruction.

The commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
when changing CR0.CD and CR0.NW.

The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
load_pdptrs() when rewriting the CR3 with the same value.

The commit a91a7c709600("KVM: X86: Don't reset mmu context when
toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
load_pdptrs() when changing CR4.PGE.

Guests like linux would keep the PDPTEs unchanged for every instance of
pagetable, so this missing reconstruction has no problem for linux
guests.

Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211216021938.11752-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:40 -05:00
Lai Jiangshan
a9f2705ec8 KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()
The host CR3 in the vcpu thread can only be changed when scheduling,
so commit 15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
changed vmx.c to only save it in vmx_prepare_switch_to_guest().

However, it also has to be synced in vmx_sync_vmcs_host_state() when switching VMCS.
vmx_set_host_fs_gs() is called in both places, so rename it to
vmx_set_vmcs_host_state() and make it update HOST_CR3.

Fixes: 15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211216021938.11752-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:39 -05:00
Paolo Bonzini
46cbc0400f Revert "KVM: X86: Update mmu->pdptrs only when it is changed"
This reverts commit 24cd19a28cb7174df502162641d6e1e12e7ffbd9.
Sean Christopherson reports:

"Commit 24cd19a28cb7 ('KVM: X86: Update mmu->pdptrs only when it is
changed') breaks nested VMs with EPT in L0 and PAE shadow paging in L2.
Reproducing is trivial, just disable EPT in L1 and run a VM.  I haven't
investigating how it breaks things."

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:39 -05:00
Linus Torvalds
1286cc4893 - Use the proper CONFIG symbol in a preprocessor check.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHRcK4ACgkQEsHwGGHe
 VUqPjw//bcEdZomdrSR9fJTWC33sKzwBHySxeBV/pIF3/TEu1r9kW0HWy1hqFnMV
 PmRNRLKzBY/nLmh9EKyYindj2V/+jNMIRSTlkSQxWRNuqTI/LFnOK4ufknqzF17M
 3PNGo7mgVo//bnPW+kedTwtKPBTq4TxHUbwnexOd51qbQM6X1fByR0KeR2sOkyeG
 DW7cTeH3pDzA5YLb6VkzaGhZzQ59Fmh7sHFOlDq4Lfij2F6Le2Vo2/Q0hGmsUDMT
 BAzUJN3kgf6FSXxg0LZ2vezYI7t00lEMUMwqD0Jrr0laBBFZg15qlpO5B6OUJWJu
 WjfCwcmBrfyjafFeinQqnqIMd5KHrjuIqBpW3MHj8sx5/3uGjVJw/fk7hn64m6Vr
 AG8qjQNs0kdad7RAaeIRpxRyDmOmiCShFNHaruz40ztNcV3IZ3/1YANJBLU/jydy
 xnBCZoGespPyNqu9MPYKaJo5BftQkx5Kwp2UcnJu6C4/pDlEGRDmIhV0hW8Owu26
 Vm/g5v/oFK35Q4bZDec0h5k4VIfN9WO2+Fu7vH/BF0v9hRz4TxGTlI2Ne4dOBbdX
 VaAyXHTNmJOUvnLMejFJWt52sk4ze2CmQ17GN6Gu4F5cm1wYYrj7vpr9SDGQmRcu
 +zzNOSNJXVA0UemiEbWLoaOBDxRaOQDf7Fi+rDlo77oPElLdulc=
 =mJVP
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.16_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Borislav Petkov:

 - Use the proper CONFIG symbol in a preprocessor check.

* tag 'x86_urgent_for_v5.16_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/build: Use the proper name CONFIG_FW_LOADER
2022-01-02 09:02:54 -08:00
Lukas Bulwahn
d6f12f8398 x86/build: Use the proper name CONFIG_FW_LOADER
Commit in Fixes intends to add the expression regex only when FW_LOADER
is enabled - not FW_LOADER_BUILTIN. Latter is a leftover from a previous
patchset and not a valid config item.

So, adjust the condition to the actual name of the config.

  [ bp: Cleanup commit message. ]

Fixes: c8dcf655ec81 ("x86/build: Tuck away built-in firmware under FW_LOADER")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20211229111553.5846-1-lukas.bulwahn@gmail.com
2021-12-29 22:20:38 +01:00
Linus Torvalds
a8ad9a2434 Another EFI fix for v5.16
- Prevent missing prototype warning from breaking the build under
   CONFIG_WERROR=y
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmHJrdYACgkQw08iOZLZ
 jyRS/wwAvrrv6xidZNqm/BAr4bhY9saQ4BZTracZuHJwD7300H9mjlMDMnVbLhOU
 oVv9hEHDqBqBKgZxIYc0ok4SMZEpieFhLuTdXzVf5dMg+BL+qHR0EQp8kLp8mssL
 xbPnMidecL2TfjPqLn34MVPe46uQo2yioxET4hDvWA5xsWNKKd0AeKVzUzdV2AVE
 elnHsdsBWpCexaIJkQEFJWf/Y3xUlzcqp7zcqgJq4hq3fZlEbD8SJ7TUM7ReAQJ6
 5afzYXBm086sn602hZVbAtOy94vLBBQswnbKntmFpcT+z/GLNjycx+A4Z2h+6jjx
 ImiIkNZA2N5ij12riXdx3n6OG0LMuILlRhhqyckgrgHyc1oYCwaUq+LOuVQENZF+
 YNGfyntQ+Y0F9dfG28atnFiL5ZFZcCSJxVZ3Dl2bql2DHptbQNHOBdXKOcDWTkpc
 JKTs3j3rigZ1I01b1SgWzH418do8s8y4iv4nmp3EdMUO8v8rEFMx3yHIHKSRjMqj
 1vAPZtWQ
 =coEy
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent-for-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI fix from Ard Biesheuvel:
 "Another EFI fix for v5.16:

   - Prevent missing prototype warning from breaking the build under
     CONFIG_WERROR=y"

* tag 'efi-urgent-for-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi: Move efifb_setup_from_dmi() prototype from arch headers
2021-12-27 08:58:35 -08:00
Linus Torvalds
e8ffcd3ab0 - Prevent potential undefined behavior due to shifting pkey constants into the sign bit
- Move the EFI memory reservation code *after* the efi= cmdline parsing has happened
 
 - Revert two commits which turned out to be the wrong direction to chase
 when accommodating early memblock reservations consolidation and command line parameters
 parsing
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHIYbAACgkQEsHwGGHe
 VUphMhAAptI7YGJFqfH6bUuVh7HX38FOx5pCODpZUXJWIsvgC0MRnxqUF0+Srzbt
 ZBMhh7WUiOpIe/ZeUpRouX5Al4nhiDdG4QeddJLc1DaSaAgOJWD9NH0V0K1aWf4k
 poAUoX7mp5M1mfc/boOQGBdtX+l+3vCYU0RzyQ+IhfDNM6GzJpAC1txWFaL/lWE+
 3IVb4OF37mW7Hlk+HdizcTgm1qaHRobwP7fxbqRwdHZy+YRjz0eOd2dR625jff6b
 y3Z7WVkwx+YtNVD7YnXvmJKyh43rbbMmx0CU6WiYlhGkz5dooe9Er2LuouS1kzRr
 4XWvZrr28qLc9eCdv6MaZGv62KyP0QBm4Fg7E/nC/0Jo8NA3GQXulF2OHofSCX/h
 BZa3mJa+heoXpUb7zDTUMa8UCdOYGJpvHRGYcU3JiaIwv+dnTRBdAzyVcqCMWt5P
 wPOl4A08vVZ+w3aNrZh3Zfg03pG4/9miveIa7EsuCZF/oN+5KAVQdcU74zneH4Qx
 nUfF7b80PyiSnCblQ3DL+efKVG4/Bud6crxo1tOtw6cDQxiAYDzupuZBlHV9791O
 0TwPuLJSAqv4aMiobi5/j32pSLaohYJij9stbfB8zeD49rq5uvT+jDoxcKiZurJP
 jamo6LHffF+o8+F2YwFOSIPayZOViXxWvwPd+E6uL7EodvZgiS0=
 =j12C
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.16_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Prevent potential undefined behavior due to shifting pkey constants
   into the sign bit

 - Move the EFI memory reservation code *after* the efi= cmdline parsing
   has happened

 - Revert two commits which turned out to be the wrong direction to
   chase when accommodating early memblock reservations consolidation
   and command line parameters parsing

* tag 'x86_urgent_for_v5.16_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/pkey: Fix undefined behaviour with PKRU_WD_BIT
  x86/boot: Move EFI range reservation after cmdline parsing
  Revert "x86/boot: Pull up cmdline preparation and early param parsing"
  Revert "x86/boot: Mark prepare_command_line() __init"
2021-12-26 10:28:55 -08:00
Linus Torvalds
ca0ea8a60b * Fix for compilation of selftests on non-x86 architectures
* Fix for kvm_run->if_flag on SEV-ES
 * Fix for page table use-after-free if yielding during exit_mm()
 * Improve behavior when userspace starts a nested guest with invalid state
 * Fix missed wakeup with assigned devices but no VT-d posted interrupts
 * Do not tell userspace to save/restore an unsupported PMU MSR
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmHCGbIUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNMSwf+PpOl69xqQCHzLpDtA9MTBZZi5uKc
 eZqMtQ7D+T7XYbtMiHBI9zofcbtHlQVC9Na74j8J7bCuvqe8UGS/z93RoZZQ+/dg
 seEbQNEgcNlPD1G3JzaipcGhRqQl6j5G9LjuixyFpkS2O27YsRTpb1HJag31uSFV
 Xd7Ih0R2YnKCNJqJov324KHpDM3c4FZL8HdCS/gG94qyLHPA2dJ8fOTwCcAK4LP6
 tl9XW5InXMVcck9ssDKFpAfipMNNPNPeAHMONbIm6hyG5HkVjcS3JT1bZmkauxky
 VIegPlAmorRCig5Hj2QfWEIOfvTUPdmm7WWh+SSZamY6cyvolWs1hV86Bg==
 =333O
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:

 - Fix for compilation of selftests on non-x86 architectures

 - Fix for kvm_run->if_flag on SEV-ES

 - Fix for page table use-after-free if yielding during exit_mm()

 - Improve behavior when userspace starts a nested guest with invalid
   state

 - Fix missed wakeup with assigned devices but no VT-d posted interrupts

 - Do not tell userspace to save/restore an unsupported PMU MSR

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU
  KVM: selftests: Add test to verify TRIPLE_FAULT on invalid L2 guest state
  KVM: VMX: Fix stale docs for kvm-intel.emulate_invalid_guest_state
  KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required
  KVM: VMX: Always clear vmx->fail on emulation_required
  selftests: KVM: Fix non-x86 compiling
  KVM: x86: Always set kvm_run->if_flag
  KVM: x86/mmu: Don't advance iterator after restart due to yielding
  KVM: x86: remove PMU FIXED_CTR3 from msrs_to_save_all
2021-12-21 12:25:57 -08:00
Paolo Bonzini
855fb0384a Merge remote-tracking branch 'kvm/master' into HEAD
Pick commit fdba608f15e2 ("KVM: VMX: Wake vCPU when delivering posted
IRQ even if vCPU == this vCPU").  In addition to fixing a bug, it
also aligns the non-nested and nested usage of triggering posted
interrupts, allowing for additional cleanups.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-21 12:51:09 -05:00
Sean Christopherson
fdba608f15 KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU
Drop a check that guards triggering a posted interrupt on the currently
running vCPU, and more importantly guards waking the target vCPU if
triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
If a vIRQ is delivered from asynchronous context, the target vCPU can be
the currently running vCPU and can also be blocking, in which case
skipping kvm_vcpu_wake_up() is effectively dropping what is supposed to
be a wake event for the vCPU.

The "do nothing" logic when "vcpu == running_vcpu" mostly works only
because the majority of calls to ->deliver_posted_interrupt(), especially
when using posted interrupts, come from synchronous KVM context.  But if
a device is exposed to the guest using vfio-pci passthrough, the VFIO IRQ
and vCPU are bound to the same pCPU, and the IRQ is _not_ configured to
use posted interrupts, wake events from the device will be delivered to
KVM from IRQ context, e.g.

  vfio_msihandler()
  |
  |-> eventfd_signal()
      |
      |-> ...
          |
          |->  irqfd_wakeup()
               |
               |->kvm_arch_set_irq_inatomic()
                  |
                  |-> kvm_irq_delivery_to_apic_fast()
                      |
                      |-> kvm_apic_set_irq()

This also aligns the non-nested and nested usage of triggering posted
interrupts, and will allow for additional cleanups.

Fixes: 379a3c8ee444 ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath")
Cc: stable@vger.kernel.org
Reported-by: Longpeng (Mike) <longpeng2@huawei.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-21 12:39:03 -05:00
Sean Christopherson
cd0e615c49 KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required
Synthesize a triple fault if L2 guest state is invalid at the time of
VM-Enter, which can happen if L1 modifies SMRAM or if userspace stuffs
guest state via ioctls(), e.g. KVM_SET_SREGS.  KVM should never emulate
invalid guest state, since from L1's perspective, it's architecturally
impossible for L2 to have invalid state while L2 is running in hardware.
E.g. attempts to set CR0 or CR4 to unsupported values will either VM-Exit
or #GP.

Modifying vCPU state via RSM+SMRAM and ioctl() are the only paths that
can trigger this scenario, as nested VM-Enter correctly rejects any
attempt to enter L2 with invalid state.

RSM is a straightforward case as (a) KVM follows AMD's SMRAM layout and
behavior, and (b) Intel's SDM states that loading reserved CR0/CR4 bits
via RSM results in shutdown, i.e. there is precedent for KVM's behavior.
Following AMD's SMRAM layout is important as AMD's layout saves/restores
the descriptor cache information, including CS.RPL and SS.RPL, and also
defines all the fields relevant to invalid guest state as read-only, i.e.
so long as the vCPU had valid state before the SMI, which is guaranteed
for L2, RSM will generate valid state unless SMRAM was modified.  Intel's
layout saves/restores only the selector, which means that scenarios where
the selector and cached RPL don't match, e.g. conforming code segments,
would yield invalid guest state.  Intel CPUs fudge around this issued by
stuffing SS.RPL and CS.RPL on RSM.  Per Intel's SDM on the "Default
Treatment of RSM", paraphrasing for brevity:

  IF internal storage indicates that the [CPU was post-VMXON]
  THEN
     enter VMX operation (root or non-root);
     restore VMX-critical state as defined in Section 34.14.1;
     set to their fixed values any bits in CR0 and CR4 whose values must
     be fixed in VMX operation [unless coming from an unrestricted guest];
     IF RFLAGS.VM = 0 AND (in VMX root operation OR the
        “unrestricted guest” VM-execution control is 0)
     THEN
       CS.RPL := SS.DPL;
       SS.RPL := SS.DPL;
     FI;
     restore current VMCS pointer;
  FI;

Note that Intel CPUs also overwrite the fixed CR0/CR4 bits, whereas KVM
will sythesize TRIPLE_FAULT in this scenario.  KVM's behavior is allowed
as both Intel and AMD define CR0/CR4 SMRAM fields as read-only, i.e. the
only way for CR0 and/or CR4 to have illegal values is if they were
modified by the L1 SMM handler, and Intel's SDM "SMRAM State Save Map"
section states "modifying these registers will result in unpredictable
behavior".

KVM's ioctl() behavior is less straightforward.  Because KVM allows
ioctls() to be executed in any order, rejecting an ioctl() if it would
result in invalid L2 guest state is not an option as KVM cannot know if
a future ioctl() would resolve the invalid state, e.g. KVM_SET_SREGS, or
drop the vCPU out of L2, e.g. KVM_SET_NESTED_STATE.  Ideally, KVM would
reject KVM_RUN if L2 contained invalid guest state, but that carries the
risk of a false positive, e.g. if RSM loaded invalid guest state and KVM
exited to userspace.  Setting a flag/request to detect such a scenario is
undesirable because (a) it's extremely unlikely to add value to KVM as a
whole, and (b) KVM would need to consider ioctl() interactions with such
a flag, e.g. if userspace migrated the vCPU while the flag were set.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211207193006.120997-3-seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-20 08:06:54 -05:00
Sean Christopherson
a80dfc0259 KVM: VMX: Always clear vmx->fail on emulation_required
Revert a relatively recent change that set vmx->fail if the vCPU is in L2
and emulation_required is true, as that behavior is completely bogus.
Setting vmx->fail and synthesizing a VM-Exit is contradictory and wrong:

  (a) it's impossible to have both a VM-Fail and VM-Exit
  (b) vmcs.EXIT_REASON is not modified on VM-Fail
  (c) emulation_required refers to guest state and guest state checks are
      always VM-Exits, not VM-Fails.

For KVM specifically, emulation_required is handled before nested exits
in __vmx_handle_exit(), thus setting vmx->fail has no immediate effect,
i.e. KVM calls into handle_invalid_guest_state() and vmx->fail is ignored.
Setting vmx->fail can ultimately result in a WARN in nested_vmx_vmexit()
firing when tearing down the VM as KVM never expects vmx->fail to be set
when L2 is active, KVM always reflects those errors into L1.

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 21158 at arch/x86/kvm/vmx/nested.c:4548
                                nested_vmx_vmexit+0x16bd/0x17e0
                                arch/x86/kvm/vmx/nested.c:4547
  Modules linked in:
  CPU: 0 PID: 21158 Comm: syz-executor.1 Not tainted 5.16.0-rc3-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547
  Code: <0f> 0b e9 2e f8 ff ff e8 57 b3 5d 00 0f 0b e9 00 f1 ff ff 89 e9 80
  Call Trace:
   vmx_leave_nested arch/x86/kvm/vmx/nested.c:6220 [inline]
   nested_vmx_free_vcpu+0x83/0xc0 arch/x86/kvm/vmx/nested.c:330
   vmx_free_vcpu+0x11f/0x2a0 arch/x86/kvm/vmx/vmx.c:6799
   kvm_arch_vcpu_destroy+0x6b/0x240 arch/x86/kvm/x86.c:10989
   kvm_vcpu_destroy+0x29/0x90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:441
   kvm_free_vcpus arch/x86/kvm/x86.c:11426 [inline]
   kvm_arch_destroy_vm+0x3ef/0x6b0 arch/x86/kvm/x86.c:11545
   kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1189 [inline]
   kvm_put_kvm+0x751/0xe40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1220
   kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3489
   __fput+0x3fc/0x870 fs/file_table.c:280
   task_work_run+0x146/0x1c0 kernel/task_work.c:164
   exit_task_work include/linux/task_work.h:32 [inline]
   do_exit+0x705/0x24f0 kernel/exit.c:832
   do_group_exit+0x168/0x2d0 kernel/exit.c:929
   get_signal+0x1740/0x2120 kernel/signal.c:2852
   arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
   handle_signal_work kernel/entry/common.c:148 [inline]
   exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
   exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
   __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
   syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
   do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: c8607e4a086f ("KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry")
Reported-by: syzbot+f1d2136db9c80d4733e8@syzkaller.appspotmail.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211207193006.120997-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-20 08:06:54 -05:00
Marc Orr
c5063551bf KVM: x86: Always set kvm_run->if_flag
The kvm_run struct's if_flag is a part of the userspace/kernel API. The
SEV-ES patches failed to set this flag because it's no longer needed by
QEMU (according to the comment in the source code). However, other
hypervisors may make use of this flag. Therefore, set the flag for
guests with encrypted registers (i.e., with guest_state_protected set).

Fixes: f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Signed-off-by: Marc Orr <marcorr@google.com>
Message-Id: <20211209155257.128747-1-marcorr@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
2021-12-20 08:06:53 -05:00
Sean Christopherson
3a0f64de47 KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator.  Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.

When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.

Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.

Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.

  WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
  RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
  Call Trace:
   <TASK>
   kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
   kvm_destroy_vm+0x162/0x2a0 [kvm]
   kvm_vcpu_release+0x34/0x60 [kvm]
   __fput+0x82/0x240
   task_work_run+0x5c/0x90
   do_exit+0x364/0xa10
   ? futex_unqueue+0x38/0x60
   do_group_exit+0x33/0xa0
   get_signal+0x155/0x850
   arch_do_signal_or_restart+0xed/0x750
   exit_to_user_mode_prepare+0xc5/0x120
   syscall_exit_to_user_mode+0x1d/0x40
   do_syscall_64+0x48/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list.  This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.

  WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
  RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
  Call Trace:
   <TASK>
   kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
   __handle_changed_spte+0x92e/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   zap_gfn_range+0x549/0x620 [kvm]
   kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
   mmu_free_root_page+0x219/0x2c0 [kvm]
   kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
   kvm_mmu_unload+0x1c/0xa0 [kvm]
   kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
   kvm_put_kvm+0x3b1/0x8b0 [kvm]
   kvm_vcpu_release+0x4e/0x70 [kvm]
   __fput+0x1f7/0x8c0
   task_work_run+0xf8/0x1a0
   do_exit+0x97b/0x2230
   do_group_exit+0xda/0x2a0
   get_signal+0x3be/0x1e50
   arch_do_signal_or_restart+0x244/0x17f0
   exit_to_user_mode_prepare+0xcb/0x120
   syscall_exit_to_user_mode+0x1d/0x40
   do_syscall_64+0x4d/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry.  But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.

Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label.  The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.

Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.

Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-20 08:06:53 -05:00
Wei Wang
9fb12fe5b9 KVM: x86: remove PMU FIXED_CTR3 from msrs_to_save_all
The fixed counter 3 is used for the Topdown metrics, which hasn't been
enabled for KVM guests. Userspace accessing to it will fail as it's not
included in get_fixed_pmc(). This breaks KVM selftests on ICX+ machines,
which have this counter.

To reproduce it on ICX+ machines, ./state_test reports:
==== Test Assertion Failure ====
lib/x86_64/processor.c:1078: r == nmsrs
pid=4564 tid=4564 - Argument list too long
1  0x000000000040b1b9: vcpu_save_state at processor.c:1077
2  0x0000000000402478: main at state_test.c:209 (discriminator 6)
3  0x00007fbe21ed5f92: ?? ??:0
4  0x000000000040264d: _start at ??:?
 Unexpected result from KVM_GET_MSRS, r: 17 (failed MSR was 0x30c)

With this patch, it works well.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Message-Id: <20211217124934.32893-1-wei.w.wang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-20 10:51:19 +01:00
Andrew Cooper
57690554ab x86/pkey: Fix undefined behaviour with PKRU_WD_BIT
Both __pkru_allows_write() and arch_set_user_pkey_access() shift
PKRU_WD_BIT (a signed constant) by up to 30 bits, hitting the
sign bit.

Use unsigned constants instead.

Clearly pkey 15 has not been used in combination with UBSAN yet.

Noticed by code inspection only.  I can't actually provoke the
compiler into generating incorrect logic as far as this shift is
concerned.

[
  dhansen: add stable@ tag, plus minor changelog massaging,

           For anyone doing backports, these #defines were in
	   arch/x86/include/asm/pgtable.h before 784a46618f6.
]

Fixes: 33a709b25a76 ("mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20211216000856.4480-1-andrew.cooper3@citrix.com
2021-12-19 22:44:34 +01:00
Linus Torvalds
f291e2d899 Two small fixes, one of which was being worked around in selftests.
-----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmG/fW8UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOVBAf/dZDUEnrsu/o4xVPjPinMihtMstGk
 RP7PqmJ+6nHeDsfoivIiigaFchpc0DYq2kU/8iOPkbsEnLyFuhKg1jUHMQiK5ZGY
 shWvJq9HEwqAQ4v3H3PC5SomwlO4UmRkbrR4NT1SwFBSIk9s+soVbDZrdL8bJW7x
 6XS01LsXeUPEcJj+h9bDGKmnibkiE4hsf9jGMgfiXgF33DB+LDNU9TH9MFn4K6J2
 ColgIqZ+o5dXn7wMcsDsw+DTj81nki+gkghmW9k5rC5FA3qmKd3e5T2sgKey3Ztd
 J+acNmGyG/hhYNE4RHXaN+ckNEcqywcCA31DfWq87aw6yP4pF2dKWu+BYw==
 =yEYt
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "Two small fixes, one of which was being worked around in selftests"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: Retry page fault if MMU reload is pending and root has no sp
  KVM: selftests: vmx_pmu_msrs_test: Drop tests mangling guest visible CPUIDs
  KVM: x86: Drop guest CPUID check for host initiated writes to MSR_IA32_PERF_CAPABILITIES
2021-12-19 12:44:03 -08:00
Sean Christopherson
18c841e1f4 KVM: x86: Retry page fault if MMU reload is pending and root has no sp
Play nice with a NULL shadow page when checking for an obsolete root in
the page fault handler by flagging the page fault as stale if there's no
shadow page associated with the root and KVM_REQ_MMU_RELOAD is pending.
Invalidating memslots, which is the only case where _all_ roots need to
be reloaded, requests all vCPUs to reload their MMUs while holding
mmu_lock for lock.

The "special" roots, e.g. pae_root when KVM uses PAE paging, are not
backed by a shadow page.  Running with TDP disabled or with nested NPT
explodes spectaculary due to dereferencing a NULL shadow page pointer.

Skip the KVM_REQ_MMU_RELOAD check if there is a valid shadow page for the
root.  Zapping shadow pages in response to guest activity, e.g. when the
guest frees a PGD, can trigger KVM_REQ_MMU_RELOAD even if the current
vCPU isn't using the affected root.  I.e. KVM_REQ_MMU_RELOAD can be seen
with a completely valid root shadow page.  This is a bit of a moot point
as KVM currently unloads all roots on KVM_REQ_MMU_RELOAD, but that will
be cleaned up in the future.

Fixes: a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211209060552.2956723-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-19 19:38:58 +01:00
Vitaly Kuznetsov
1aa2abb33a KVM: x86: Drop guest CPUID check for host initiated writes to MSR_IA32_PERF_CAPABILITIES
The ability to write to MSR_IA32_PERF_CAPABILITIES from the host should
not depend on guest visible CPUID entries, even if just to allow
creating/restoring guest MSRs and CPUIDs in any sequence.

Fixes: 27461da31089 ("KVM: x86/pmu: Support full width counting")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211216165213.338923-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-19 19:35:18 +01:00
Linus Torvalds
180f3bcfe3 Networking fixes for 5.16-rc6, including fixes from mac80211, wifi, bpf.
Current release - regressions:
 
  - dpaa2-eth: fix buffer overrun when reporting ethtool statistics
 
 Current release - new code bugs:
 
  - bpf: fix incorrect state pruning for <8B spill/fill
 
  - iavf:
      - add missing unlocks in iavf_watchdog_task()
      - do not override the adapter state in the watchdog task (again)
 
  - mlxsw: spectrum_router: consolidate MAC profiles when possible
 
 Previous releases - regressions:
 
  - mac80211, fix:
      - rate control, avoid driver crash for retransmitted frames
      - regression in SSN handling of addba tx
      - a memory leak where sta_info is not freed
      - marking TX-during-stop for TX in in_reconfig, prevent stall
 
  - cfg80211: acquire wiphy mutex on regulatory work
 
  - wifi drivers: fix build regressions and LED config dependency
 
  - virtio_net: fix rx_drops stat for small pkts
 
  - dsa: mv88e6xxx: unforce speed & duplex in mac_link_down()
 
 Previous releases - always broken:
 
  - bpf, fix:
     - kernel address leakage in atomic fetch
     - kernel address leakage in atomic cmpxchg's r0 aux reg
     - signed bounds propagation after mov32
     - extable fixup offset
     - extable address check
 
  - mac80211:
      - fix the size used for building probe request
      - send ADDBA requests using the tid/queue of the aggregation
        session
      - agg-tx: don't schedule_and_wake_txq() under sta->lock,
        avoid deadlocks
      - validate extended element ID is present
 
  - mptcp:
      - never allow the PM to close a listener subflow (null-defer)
      - clear 'kern' flag from fallback sockets, prevent crash
      - fix deadlock in __mptcp_push_pending()
 
  - inet_diag: fix kernel-infoleak for UDP sockets
 
  - xsk: do not sleep in poll() when need_wakeup set
 
  - smc: avoid very long waits in smc_release()
 
  - sch_ets: don't remove idle classes from the round-robin list
 
  - netdevsim:
      - zero-initialize memory for bpf map's value, prevent info leak
      - don't let user space overwrite read only (max) ethtool parms
 
  - ixgbe: set X550 MDIO speed before talking to PHY
 
  - stmmac:
      - fix null-deref in flower deletion w/ VLAN prio Rx steering
      - dwmac-rk: fix oob read in rk_gmac_setup
 
  - ice: time stamping fixes
 
  - systemport: add global locking for descriptor life cycle
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmG7rdUACgkQMUZtbf5S
 IrtRvw//etsgeg2+zxe+fBSbe7ZihcCB4yzWUoRDdNzPrLNLsnWxKT1wYblDcZft
 b1f/SpTy9ycfg+fspn2qET8gzydn4m9xHkjmlQPzmXB9tdIDF6mECFTAXYlar1hQ
 RQIijpfZYyrZeGdgHpsyq72YC4dpNdbZrxmQFVdpMr3cK8P2N0Dn32bBVa//+jb+
 LCv3Uw9C0yNbqhtRIiukkWIE20+/pXtKm0uErDVmvonqFMWPo6mYD0C2PwC20PwR
 Kv5ok6jH+44fCSwDoLChbB+Wes0AtrIQdUvUwXGXaF3MDfZl+24oLkX5xJl3EHWT
 90Mh0k0NhRORgBZ3NItwK7OliohrRHCYxlAXPjg1Dicxl+kxl0wPlva8v64eAA+u
 ZhwXwaQpCrZNdKoxHJw9kQ/CmbggtxcWkVolbZp3TzDjYY1E7qxuwg51YMhGmGT1
 FPjradYGvHKi+thizJiEdiZaMKRc8bpaL0hbpROxFQvfjNwFOwREQhtnXYP3W5Kd
 lK88fWaH86dxqL+ABvbrMnSZKuNlSL8R/CROWpZuF+vyLRXaxhAvYRrL79bgmkKq
 zvImnh1mFovdyKGJhibFMdy92X14z8FzoyX3VQuFcl9EB+2NQXnNZ6abDLJlufZX
 A0jQ5r46Ce/yyaXXmS61PrP7Pf5sxhs/69fqAIDQfSSzpyUKHd4=
 =VIbd
 -----END PGP SIGNATURE-----

Merge tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes, including fixes from mac80211, wifi, bpf.

  Relatively large batches of fixes from BPF and the WiFi stack, calm in
  general networking.

  Current release - regressions:

   - dpaa2-eth: fix buffer overrun when reporting ethtool statistics

  Current release - new code bugs:

   - bpf: fix incorrect state pruning for <8B spill/fill

   - iavf:
       - add missing unlocks in iavf_watchdog_task()
       - do not override the adapter state in the watchdog task (again)

   - mlxsw: spectrum_router: consolidate MAC profiles when possible

  Previous releases - regressions:

   - mac80211 fixes:
       - rate control, avoid driver crash for retransmitted frames
       - regression in SSN handling of addba tx
       - a memory leak where sta_info is not freed
       - marking TX-during-stop for TX in in_reconfig, prevent stall

   - cfg80211: acquire wiphy mutex on regulatory work

   - wifi drivers: fix build regressions and LED config dependency

   - virtio_net: fix rx_drops stat for small pkts

   - dsa: mv88e6xxx: unforce speed & duplex in mac_link_down()

  Previous releases - always broken:

   - bpf fixes:
       - kernel address leakage in atomic fetch
       - kernel address leakage in atomic cmpxchg's r0 aux reg
       - signed bounds propagation after mov32
       - extable fixup offset
       - extable address check

   - mac80211:
       - fix the size used for building probe request
       - send ADDBA requests using the tid/queue of the aggregation
         session
       - agg-tx: don't schedule_and_wake_txq() under sta->lock, avoid
         deadlocks
       - validate extended element ID is present

   - mptcp:
       - never allow the PM to close a listener subflow (null-defer)
       - clear 'kern' flag from fallback sockets, prevent crash
       - fix deadlock in __mptcp_push_pending()

   - inet_diag: fix kernel-infoleak for UDP sockets

   - xsk: do not sleep in poll() when need_wakeup set

   - smc: avoid very long waits in smc_release()

   - sch_ets: don't remove idle classes from the round-robin list

   - netdevsim:
       - zero-initialize memory for bpf map's value, prevent info leak
       - don't let user space overwrite read only (max) ethtool parms

   - ixgbe: set X550 MDIO speed before talking to PHY

   - stmmac:
       - fix null-deref in flower deletion w/ VLAN prio Rx steering
       - dwmac-rk: fix oob read in rk_gmac_setup

   - ice: time stamping fixes

   - systemport: add global locking for descriptor life cycle"

* tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (89 commits)
  bpf, selftests: Fix racing issue in btf_skc_cls_ingress test
  selftest/bpf: Add a test that reads various addresses.
  bpf: Fix extable address check.
  bpf: Fix extable fixup offset.
  bpf, selftests: Add test case trying to taint map value pointer
  bpf: Make 32->64 bounds propagation slightly more robust
  bpf: Fix signed bounds propagation after mov32
  sit: do not call ipip6_dev_free() from sit_init_net()
  net: systemport: Add global locking for descriptor lifecycle
  net/smc: Prevent smc_release() from long blocking
  net: Fix double 0x prefix print in SKB dump
  virtio_net: fix rx_drops stat for small pkts
  dsa: mv88e6xxx: fix debug print for SPEED_UNFORCED
  sfc_ef100: potential dereference of null pointer
  net: stmmac: dwmac-rk: fix oob read in rk_gmac_setup
  net: usb: lan78xx: add Allied Telesis AT29M2-AF
  net/packet: rx_owner_map depends on pg_vec
  netdevsim: Zero-initialize memory for new map's value in function nsim_bpf_map_alloc
  dpaa2-eth: fix ethtool statistics
  ixgbe: set X550 MDIO speed before talking to PHY
  ...
2021-12-16 15:02:14 -08:00
Alexei Starovoitov
588a25e924 bpf: Fix extable address check.
The verifier checks that PTR_TO_BTF_ID pointer is either valid or NULL,
but it cannot distinguish IS_ERR pointer from valid one.

When offset is added to IS_ERR pointer it may become small positive
value which is a user address that is not handled by extable logic
and has to be checked for at the runtime.

Tighten BPF_PROBE_MEM pointer check code to prevent this case.

Fixes: 4c5de127598e ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.")
Reported-by: Lorenzo Fontana <lorenzo.fontana@elastic.co>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2021-12-16 21:41:04 +01:00
Alexei Starovoitov
433956e912 bpf: Fix extable fixup offset.
The prog - start_of_ldx is the offset before the faulting ldx to the location
after it, so this will be used to adjust pt_regs->ip for jumping over it and
continuing, and with old temp it would have been fixed up to the wrong offset,
causing crash.

Fixes: 4c5de127598e ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2021-12-16 21:18:26 +01:00
Mike Rapoport
2f5b3514c3 x86/boot: Move EFI range reservation after cmdline parsing
The memory reservation in arch/x86/platform/efi/efi.c depends on at
least two command line parameters. Put it back later in the boot process
and move efi_memblock_x86_reserve_range() out of early_memory_reserve().

An attempt to fix this was done in

  8d48bf8206f7 ("x86/boot: Pull up cmdline preparation and early param parsing")

but that caused other troubles so it got reverted.

The bug this is addressing is:

Dan reports that Anjaneya Chagam can no longer use the efi=nosoftreserve
kernel command line parameter to suppress "soft reservation" behavior.

This is due to the fact that the following call-chain happens at boot:

  early_reserve_memory
  |-> efi_memblock_x86_reserve_range
      |-> efi_fake_memmap_early

which does

        if (!efi_soft_reserve_enabled())
                return;

and that would have set EFI_MEM_NO_SOFT_RESERVE after having parsed
"nosoftreserve".

However, parse_early_param() gets called *after* it, leading to the boot
cmdline not being taken into account.

See also https://lore.kernel.org/r/e8dd8993c38702ee6dd73b3c11f158617e665607.camel@intel.com

  [ bp: Turn into a proper patch. ]

Signed-off-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211213112757.2612-4-bp@alien8.de
2021-12-15 14:07:54 +01:00
Borislav Petkov
fbe6183998 Revert "x86/boot: Pull up cmdline preparation and early param parsing"
This reverts commit 8d48bf8206f77aa8687f0e241e901e5197e52423.

It turned out to be a bad idea as it broke supplying mem= cmdline
parameters due to parse_memopt() requiring preparatory work like setting
up the e820 table in e820__memory_setup() in order to be able to exclude
the range specified by mem=.

Pulling that up would've broken Xen PV again, see threads at

  https://lkml.kernel.org/r/20210920120421.29276-1-jgross@suse.com

due to xen_memory_setup() needing the first reservations in
early_reserve_memory() - kernel and initrd - to have happened already.

This could be fixed again by having Xen do those reservations itself...

Long story short, revert this and do a simpler fix in a later patch.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211213112757.2612-3-bp@alien8.de
2021-12-15 11:38:57 +01:00
Borislav Petkov
58e138d624 Revert "x86/boot: Mark prepare_command_line() __init"
This reverts commit c0f2077baa4113f38f008b8e912b9fb3ff8d43df.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211213112757.2612-2-bp@alien8.de
2021-12-15 11:14:28 +01:00
Javier Martinez Canillas
4bc5e64e6c efi: Move efifb_setup_from_dmi() prototype from arch headers
Commit 8633ef82f101 ("drivers/firmware: consolidate EFI framebuffer setup
for all arches") made the Generic System Framebuffers (sysfb) driver able
to be built on non-x86 architectures.

But it left the efifb_setup_from_dmi() function prototype declaration in
the architecture specific headers. This could lead to the following
compiler warning as reported by the kernel test robot:

   drivers/firmware/efi/sysfb_efi.c:70:6: warning: no previous prototype for function 'efifb_setup_from_dmi' [-Wmissing-prototypes]
   void efifb_setup_from_dmi(struct screen_info *si, const char *opt)
        ^
   drivers/firmware/efi/sysfb_efi.c:70:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void efifb_setup_from_dmi(struct screen_info *si, const char *opt)

Fixes: 8633ef82f101 ("drivers/firmware: consolidate EFI framebuffer setup for all arches")
Reported-by: kernel test robot <lkp@intel.com>
Cc: <stable@vger.kernel.org> # 5.15.x
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://lore.kernel.org/r/20211126001333.555514-1-javierm@redhat.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2021-12-13 15:07:16 +01:00
Linus Torvalds
773602256a A single fix for the x86 scheduler topology:
Using cluster topology on hybrid CPUs, e.g. Alder Lake, biases the
   scheduler towards the ATOM cluster as that has more total capacity.
   Use selection based on CPU priority instead.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmG1xxETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWhaEADFlr1l6QwIJs8Bv3Z9u/kkW1fPIoqP
 aFDrqLiQ/T+gBuS+d7R3YkRf1cICS+gbvv/dw8qx/v6gerUC71tEXxYxvOUqR7QL
 wX2zFNrhQqBaIKKGnbTHe47pB2+/AlpbXuQ2/j415AHxCxB4fOC1N4VKq0hTUyPu
 Ubz+QmOHcB0NhAf1jeN9DAlTJgQNbRW6oYYPJDsr+n+WjdlN3IZW/hCxLN1cvnNd
 NzsDEK1EeuccteoT5btzYy+XB31+WNrZ9afcJgG7goO/zmJfZ1lWAsyjKPocn6dC
 ozZSvHQhbgJIAwBaZJFyjZocgHTIkdFX6NzlpKaWw05tk984hJJ9pvg4KO5YeNx/
 tIehPnkmd9GwwO541+t3DhcbCPQIiKypRZKnXuczkjIuG3t2oD25b1EAqwhoPKGk
 TEX2o4DjixS3oKUP6khDteQkj8B0qOjeY18eGjBilJ764mRfbL+GRk/NoqVidrnt
 JWSCPEMM842Y8W7GYfZZZAypqDz9U0yRl+BCKpNDcx3Nr5OSbfkWC/LdvM3LlgIE
 /a2C6CXIZXgiifTF2PHOmv0O3/VPAtHXZ01ExzZu/MJP0hBUjLrxsj+TwILgociw
 WLVYnBgCf5XN1rMPmora8P5y+O/UpEYhXbmIFa6HQAQyfc0Ffiof4ZOl5VYf3mNh
 1GLB+ByEgCz4ig==
 =rr0v
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2021-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Thomas Gleixner:
 "A single fix for the x86 scheduler topology:

  Using cluster topology on hybrid CPUs, e.g. Alder Lake, biases the
  scheduler towards the ATOM cluster as that has more total capacity.
  Use selection based on CPU priority instead"

* tag 'sched-urgent-2021-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched,x86: Don't use cluster topology for x86 hybrid CPUs
2021-12-12 09:38:04 -08:00
Linus Torvalds
b9172f9e88 More x86 fixes:
* Logic bugs in CR0 writes and Hyper-V hypercalls
 * Don't use Enlightened MSR Bitmap for L3
 * Remove user-triggerable WARN
 
 Plus a few selftest fixes and a regression test for the
 user-triggerable WARN.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGzZuIUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMD4Qf+Im7Q0XNRZGzK6x4Blu3ZZJSuIIkW
 gEK5mDX/BWSYxoGhRN0IOkyf1Tx/A5qYwbZts87wZSvKONG2MuVzdeQ0mkDxgKc3
 cYwvvIPxCKaW/dQLD2OKVlqdAv6YbeJiFURWXgszMkrcgHvw39H5Tn6ldi0B5nvg
 Gvpj8LtbPDXGXab//Xrhia3+1F9TKOrcOG+obGC5G2mrGKTkG2+pi9L6LohvENhd
 sOSWdpmvQTU4PeqGlhW8RCwcN+vpa+NasHT2i2tHcWZA9Lqp4P81+4ZyQLIBsRB3
 psANG0c40EW+lfjFGbLL/6VR5kypxa6zy9RgX+QiRcj6C0+dJOgnwNY35A==
 =cREz
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "More x86 fixes:

   - Logic bugs in CR0 writes and Hyper-V hypercalls

   - Don't use Enlightened MSR Bitmap for L3

   - Remove user-triggerable WARN

  Plus a few selftest fixes and a regression test for the
  user-triggerable WARN"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  selftests: KVM: Add test to verify KVM doesn't explode on "bad" I/O
  KVM: x86: Don't WARN if userspace mucks with RCX during string I/O exit
  KVM: X86: Raise #GP when clearing CR0_PG in 64 bit mode
  selftests: KVM: avoid failures due to reserved HyperTransport region
  KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req
  KVM: x86: Wait for IPIs to be delivered when handling Hyper-V TLB flush hypercall
  KVM: x86: selftests: svm_int_ctl_test: fix intercept calculation
  KVM: nVMX: Don't use Enlightened MSR Bitmap for L3
2021-12-10 14:09:12 -08:00
Sean Christopherson
d07898eaf3 KVM: x86: Don't WARN if userspace mucks with RCX during string I/O exit
Replace a WARN with a comment to call out that userspace can modify RCX
during an exit to userspace to handle string I/O.  KVM doesn't actually
support changing the rep count during an exit, i.e. the scenario can be
ignored, but the WARN needs to go as it's trivial to trigger from
userspace.

Cc: stable@vger.kernel.org
Fixes: 3b27de271839 ("KVM: x86: split the two parts of emulator_pio_in")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211025201311.1881846-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 09:38:02 -05:00
Lai Jiangshan
777ab82d7c KVM: X86: Raise #GP when clearing CR0_PG in 64 bit mode
In the SDM:
If the logical processor is in 64-bit mode or if CR4.PCIDE = 1, an
attempt to clear CR0.PG causes a general-protection exception (#GP).
Software should transition to compatibility mode and clear CR4.PCIDE
before attempting to disable paging.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211207095230.53437-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 09:38:01 -05:00
Sean Christopherson
3244867af8 KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req
Do not bail early if there are no bits set in the sparse banks for a
non-sparse, a.k.a. "all CPUs", IPI request.  Per the Hyper-V spec, it is
legal to have a variable length of '0', e.g. VP_SET's BankContents in
this case, if the request can be serviced without the extra info.

  It is possible that for a given invocation of a hypercall that does
  accept variable sized input headers that all the header input fits
  entirely within the fixed size header. In such cases the variable sized
  input header is zero-sized and the corresponding bits in the hypercall
  input should be set to zero.

Bailing early results in KVM failing to send IPIs to all CPUs as expected
by the guest.

Fixes: 214ff83d4473 ("KVM: x86: hyperv: implement PV IPI send hypercalls")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 07:12:42 -05:00
Vitaly Kuznetsov
1ebfaa11eb KVM: x86: Wait for IPIs to be delivered when handling Hyper-V TLB flush hypercall
Prior to commit 0baedd792713 ("KVM: x86: make Hyper-V PV TLB flush use
tlb_flush_guest()"), kvm_hv_flush_tlb() was using 'KVM_REQ_TLB_FLUSH |
KVM_REQUEST_NO_WAKEUP' when making a request to flush TLBs on other vCPUs
and KVM_REQ_TLB_FLUSH is/was defined as:

 (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)

so KVM_REQUEST_WAIT was lost. Hyper-V TLFS, however, requires that
"This call guarantees that by the time control returns back to the
caller, the observable effects of all flushes on the specified virtual
processors have occurred." and without KVM_REQUEST_WAIT there's a small
chance that the vCPU making the TLB flush will resume running before
all IPIs get delivered to other vCPUs and a stale mapping can get read
there.

Fix the issue by adding KVM_REQUEST_WAIT flag to KVM_REQ_TLB_FLUSH_GUEST:
kvm_hv_flush_tlb() is the sole caller which uses it for
kvm_make_all_cpus_request()/kvm_make_vcpus_request_mask() where
KVM_REQUEST_WAIT makes a difference.

Cc: stable@kernel.org
Fixes: 0baedd792713 ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211209102937.584397-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10 07:12:41 -05:00
David Woodhouse
6f2cdbdba4 KVM: Add Makefile.kvm for common files, use it for x86
Splitting kvm_main.c out into smaller and better-organized files is
slightly non-trivial when it involves editing a bunch of per-arch
KVM makefiles. Provide virt/kvm/Makefile.kvm for them to include.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211121125451.9489-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-09 12:56:02 -05:00
David Woodhouse
dc70ec217c KVM: Introduce CONFIG_HAVE_KVM_DIRTY_RING
I'd like to make the build include dirty_ring.c based on whether the
arch wants it or not. That's a whole lot simpler if there's a config
symbol instead of doing it implicitly on KVM_DIRTY_LOG_PAGE_OFFSET
being set to something non-zero.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211121125451.9489-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-09 12:55:52 -05:00
Sean Christopherson
45af1bb99b KVM: VMX: Clean up PI pre/post-block WARNs
Move the WARN sanity checks out of the PI descriptor update loop so as
not to spam the kernel log if the condition is violated and the update
takes multiple attempts due to another writer.  This also eliminates a
few extra uops from the retry path.

Technically not checking every attempt could mean KVM will now fail to
WARN in a scenario that would have failed before, but any such failure
would be inherently racy as some other agent (CPU or device) would have
to concurrent modify the PI descriptor.

Add a helper to handle the actual write and more importantly to document
why the write may need to be retried.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-09 09:30:47 -05:00
Sean Christopherson
83c98007d9 KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ fails
Add a memory barrier between writing vcpu->requests and reading
vcpu->guest_mode to ensure the read is ordered after the write when
(potentially) delivering an IRQ to L2 via nested posted interrupt.  If
the request were to be completed after reading vcpu->mode, it would be
possible for the target vCPU to enter the guest without posting the
interrupt and without handling the event request.

Note, the barrier is only for documentation since atomic operations are
serializing on x86.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 6b6977117f50 ("KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2")
Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-09 09:30:46 -05:00
Maxim Levitsky
8e819d75cb KVM: x86: add a tracepoint for APICv/AVIC interrupt delivery
This allows to see how many interrupts were delivered via the
APICv/AVIC from the host.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211209115440.394441-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-09 09:07:39 -05:00
Peter Zijlstra
cabdc3a847 sched,x86: Don't use cluster topology for x86 hybrid CPUs
For x86 hybrid CPUs like Alder Lake, the order of CPU selection should
be based strictly on CPU priority.  Don't include cluster topology for
hybrid CPUs to avoid interference with such CPU selection order.

On Alder Lake, the Atom CPU cluster has more capacity (4 Atom CPUs) vs
Big core cluster (2 hyperthread CPUs). This could potentially bias CPU
selection towards Atom over Big Core, when Big core CPU has higher
priority.

Fixes: 66558b730f25 ("sched: Add cluster scheduler level for x86")
Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Link: https://lkml.kernel.org/r/20211204091402.GM16608@worktop.programming.kicks-ass.net
2021-12-08 22:15:37 +01:00
Vitaly Kuznetsov
502d2bf5f2 KVM: nVMX: Implement Enlightened MSR Bitmap feature
Updating MSR bitmap for L2 is not cheap and rearly needed. TLFS for Hyper-V
offers 'Enlightened MSR Bitmap' feature which allows L1 hypervisor to
inform L0 when it changes MSR bitmap, this eliminates the need to examine
L1's MSR bitmap for L2 every time when 'real' MSR bitmap for L2 gets
constructed.

Use 'vmx->nested.msr_bitmap_changed' flag to implement the feature.

Note, KVM already uses 'Enlightened MSR bitmap' feature when it runs as a
nested hypervisor on top of Hyper-V. The newly introduced feature is going
to be used by Hyper-V guests on KVM.

When the feature is enabled for Win10+WSL2, it shaves off around 700 CPU
cycles from a nested vmexit cost (tight cpuid loop test).

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 09:06:05 -05:00
Vitaly Kuznetsov
ed2a4800ae KVM: nVMX: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt
Introduce a flag to keep track of whether MSR bitmap for L2 needs to be
rebuilt due to changes in MSR bitmap for L1 or switching to a different
L2. This information will be used for Enlightened MSR Bitmap feature for
Hyper-V guests.

Note, setting msr_bitmap_changed to 'true' from set_current_vmptr() is
not really needed for Enlightened MSR Bitmap as the feature can only
be used in conjunction with Enlightened VMCS but let's keep tracking
information complete, it's cheap and in the future similar PV feature can
easily be implemented for KVM on KVM too.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 09:06:04 -05:00
Vitaly Kuznetsov
b84155c380 KVM: VMX: Introduce vmx_msr_bitmap_l01_changed() helper
In preparation to enabling 'Enlightened MSR Bitmap' feature for Hyper-V
guests move MSR bitmap update tracking to a dedicated helper.

Note: vmx_msr_bitmap_l01_changed() is called when MSR bitmap might be
updated. KVM doesn't check if the bit we're trying to set is already set
(or the bit it's trying to clear is already cleared). Such situations
should not be common and a few false positives should not be a problem.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211129094704.326635-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 09:06:04 -05:00
Paolo Bonzini
93b350f884 Merge branch 'kvm-on-hv-msrbm-fix' into HEAD
Merge bugfix for enlightened MSR Bitmap, before adding support
to KVM for exposing the feature to nested guests.
2021-12-08 05:30:48 -05:00
Hou Wenlong
adbfb12d4c KVM: x86: Exit to userspace if emulation prepared a completion callback
em_rdmsr() and em_wrmsr() return X86EMUL_IO_NEEDED if MSR accesses
required an exit to userspace. However, x86_emulate_insn() doesn't return
X86EMUL_*, so x86_emulate_instruction() doesn't directly act on
X86EMUL_IO_NEEDED; instead, it looks for other signals to differentiate
between PIO, MMIO, etc. causing RDMSR/WRMSR emulation not to
exit to userspace now.

Nevertheless, if the userspace_msr_exit_test testcase in selftests
is changed to test RDMSR/WRMSR with a forced emulation prefix,
the test passes.  What happens is that first userspace exit
information is filled but the userspace exit does not happen.
Because x86_emulate_instruction() returns 1, the guest retries
the instruction---but this time RIP has already been adjusted
past the forced emulation prefix, so the guest executes RDMSR/WRMSR
and the userspace exit finally happens.

Since the X86EMUL_IO_NEEDED path has provided a complete_userspace_io
callback, x86_emulate_instruction() can just return 0 if the
callback is not NULL. Then RDMSR/WRMSR instruction emulation will
exit to userspace directly, without the RDMSR/WRMSR vmexit.

Fixes: 1ae099540e8c7 ("KVM: x86: Allow deflecting unknown MSR accesses to user space")
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <56f9df2ee5c05a81155e2be366c9dc1f7adc8817.1635842679.git.houwenlong93@linux.alibaba.com>
2021-12-08 05:24:11 -05:00
Vitaly Kuznetsov
250552b925 KVM: nVMX: Don't use Enlightened MSR Bitmap for L3
When KVM runs as a nested hypervisor on top of Hyper-V it uses Enlightened
VMCS and enables Enlightened MSR Bitmap feature for its L1s and L2s (which
are actually L2s and L3s from Hyper-V's perspective). When MSR bitmap is
updated, KVM has to reset HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP from
clean fields to make Hyper-V aware of the change. For KVM's L1s, this is
done in vmx_disable_intercept_for_msr()/vmx_enable_intercept_for_msr().
MSR bitmap for L2 is build in nested_vmx_prepare_msr_bitmap() by blending
MSR bitmap for L1 and L1's idea of MSR bitmap for L2. KVM, however, doesn't
check if the resulting bitmap is different and never cleans
HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP in eVMCS02. This is incorrect and
may result in Hyper-V missing the update.

The issue could've been solved by calling evmcs_touch_msr_bitmap() for
eVMCS02 from nested_vmx_prepare_msr_bitmap() unconditionally but doing so
would not give any performance benefits (compared to not using Enlightened
MSR Bitmap at all). 3-level nesting is also not a very common setup
nowadays.

Don't enable 'Enlightened MSR Bitmap' feature for KVM's L2s (real L3s) for
now.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:53:13 -05:00
Hou Wenlong
d2f7d49826 KVM: x86: Use different callback if msr access comes from the emulator
If msr access triggers an exit to userspace, the
complete_userspace_io callback would skip instruction by vendor
callback for kvm_skip_emulated_instruction(). However, when msr
access comes from the emulator, e.g. if kvm.force_emulation_prefix
is enabled and the guest uses rdmsr/wrmsr with kvm prefix,
VM_EXIT_INSTRUCTION_LEN in vmcs is invalid and
kvm_emulate_instruction() should be used to skip instruction
instead.

As Sean noted, unlike the previous case, there's no #UD if
unrestricted guest is disabled and the guest accesses an MSR in
Big RM. So the correct way to fix this is to attach a different
callback when the msr access comes from the emulator.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <34208da8f51580a06e45afefac95afea0e3f96e3.1635842679.git.houwenlong93@linux.alibaba.com>
2021-12-08 04:25:16 -05:00
Hou Wenlong
906fa90416 KVM: x86: Add an emulation type to handle completion of user exits
The next patch would use kvm_emulate_instruction() with
EMULTYPE_SKIP in complete_userspace_io callback to fix a
problem in msr access emulation. However, EMULTYPE_SKIP
only updates RIP, more things like updating interruptibility
state and injecting single-step #DBs would be done in the
callback. Since the emulator also does those things after
x86_emulate_insn(), add a new emulation type to pair with
EMULTYPE_SKIP to do those things for completion of user exits
within the emulator.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <8f8c8e268b65f31d55c2881a4b30670946ecfa0d.1635842679.git.houwenlong93@linux.alibaba.com>
2021-12-08 04:25:15 -05:00
Sean Christopherson
5e854864ee KVM: x86: Handle 32-bit wrap of EIP for EMULTYPE_SKIP with flat code seg
Truncate the new EIP to a 32-bit value when handling EMULTYPE_SKIP as the
decode phase does not truncate _eip.  Wrapping the 32-bit boundary is
legal if and only if CS is a flat code segment, but that check is
implicitly handled in the form of limit checks in the decode phase.

Opportunstically prepare for a future fix by storing the result of any
truncation in "eip" instead of "_eip".

Fixes: 1957aa63be53 ("KVM: VMX: Handle single-step #DB for EMULTYPE_SKIP on EPT misconfig")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <093eabb1eab2965201c9b018373baf26ff256d85.1635842679.git.houwenlong93@linux.alibaba.com>
2021-12-08 04:25:15 -05:00
Li RongQing
51b1209c61 KVM: Clear pv eoi pending bit only when it is set
merge pv_eoi_get_pending and pv_eoi_clr_pending into a single
function pv_eoi_test_and_clear_pending, which returns and clear
the value of the pending bit.

This makes it possible to clear the pending bit only if the guest
had set it, and otherwise skip the call to pv_eoi_put_user().
This can save up to 300 nsec on AMD EPYC processors.

Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-Id: <1636026974-50555-2-git-send-email-lirongqing@baidu.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:14 -05:00