1065 Commits

Author SHA1 Message Date
Sean Christopherson
2b01281273 KVM: Register /dev/kvm as the _very_ last thing during initialization
Register /dev/kvm, i.e. expose KVM to userspace, only after all other
setup has completed.  Once /dev/kvm is exposed, userspace can start
invoking KVM ioctls, creating VMs, etc...  If userspace creates a VM
before KVM is done with its configuration, bad things may happen, e.g.
KVM will fail to properly migrate vCPU state if a VM is created before
KVM has registered preemption notifiers.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:40:42 -05:00
Linus Torvalds
8fa590bf34 ARM64:
* Enable the per-vcpu dirty-ring tracking mechanism, together with an
   option to keep the good old dirty log around for pages that are
   dirtied by something other than a vcpu.
 
 * Switch to the relaxed parallel fault handling, using RCU to delay
   page table reclaim and giving better performance under load.
 
 * Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option,
   which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d:
   "Fix a number of issues with MTE, such as races on the tags being
   initialised vs the PG_mte_tagged flag as well as the lack of support
   for VM_SHARED when KVM is involved.  Patches from Catalin Marinas and
   Peter Collingbourne").
 
 * Merge the pKVM shadow vcpu state tracking that allows the hypervisor
   to have its own view of a vcpu, keeping that state private.
 
 * Add support for the PMUv3p5 architecture revision, bringing support
   for 64bit counters on systems that support it, and fix the
   no-quite-compliant CHAIN-ed counter support for the machines that
   actually exist out there.
 
 * Fix a handful of minor issues around 52bit VA/PA support (64kB pages
   only) as a prefix of the oncoming support for 4kB and 16kB pages.
 
 * Pick a small set of documentation and spelling fixes, because no
   good merge window would be complete without those.
 
 s390:
 
 * Second batch of the lazy destroy patches
 
 * First batch of KVM changes for kernel virtual != physical address support
 
 * Removal of a unused function
 
 x86:
 
 * Allow compiling out SMM support
 
 * Cleanup and documentation of SMM state save area format
 
 * Preserve interrupt shadow in SMM state save area
 
 * Respond to generic signals during slow page faults
 
 * Fixes and optimizations for the non-executable huge page errata fix.
 
 * Reprogram all performance counters on PMU filter change
 
 * Cleanups to Hyper-V emulation and tests
 
 * Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest
   running on top of a L1 Hyper-V hypervisor)
 
 * Advertise several new Intel features
 
 * x86 Xen-for-KVM:
 
 ** Allow the Xen runstate information to cross a page boundary
 
 ** Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
 
 ** Add support for 32-bit guests in SCHEDOP_poll
 
 * Notable x86 fixes and cleanups:
 
 ** One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
 
 ** Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
    years back when eliminating unnecessary barriers when switching between
    vmcs01 and vmcs02.
 
 ** Clean up vmread_error_trampoline() to make it more obvious that params
    must be passed on the stack, even for x86-64.
 
 ** Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
    of the current guest CPUID.
 
 ** Fudge around a race with TSC refinement that results in KVM incorrectly
    thinking a guest needs TSC scaling when running on a CPU with a
    constant TSC, but no hardware-enumerated TSC frequency.
 
 ** Advertise (on AMD) that the SMM_CTL MSR is not supported
 
 ** Remove unnecessary exports
 
 Generic:
 
 * Support for responding to signals during page faults; introduces
   new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
 
 Selftests:
 
 * Fix an inverted check in the access tracking perf test, and restore
   support for asserting that there aren't too many idle pages when
   running on bare metal.
 
 * Fix build errors that occur in certain setups (unsure exactly what is
   unique about the problematic setup) due to glibc overriding
   static_assert() to a variant that requires a custom message.
 
 * Introduce actual atomics for clear/set_bit() in selftests
 
 * Add support for pinning vCPUs in dirty_log_perf_test.
 
 * Rename the so called "perf_util" framework to "memstress".
 
 * Add a lightweight psuedo RNG for guest use, and use it to randomize
   the access pattern and write vs. read percentage in the memstress tests.
 
 * Add a common ucall implementation; code dedup and pre-work for running
   SEV (and beyond) guests in selftests.
 
 * Provide a common constructor and arch hook, which will eventually be
   used by x86 to automatically select the right hypercall (AMD vs. Intel).
 
 * A bunch of added/enabled/fixed selftests for ARM64, covering memslots,
   breakpoints, stage-2 faults and access tracking.
 
 * x86-specific selftest changes:
 
 ** Clean up x86's page table management.
 
 ** Clean up and enhance the "smaller maxphyaddr" test, and add a related
    test to cover generic emulation failure.
 
 ** Clean up the nEPT support checks.
 
 ** Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
 
 ** Fix an ordering issue in the AMX test introduced by recent conversions
    to use kvm_cpu_has(), and harden the code to guard against similar bugs
    in the future.  Anything that tiggers caching of KVM's supported CPUID,
    kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
    the caching occurs before the test opts in via prctl().
 
 Documentation:
 
 * Remove deleted ioctls from documentation
 
 * Clean up the docs for the x86 MSR filter.
 
 * Various fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt
 KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF
 mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex
 yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii
 Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW
 MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA==
 =QAYX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM64:

   - Enable the per-vcpu dirty-ring tracking mechanism, together with an
     option to keep the good old dirty log around for pages that are
     dirtied by something other than a vcpu.

   - Switch to the relaxed parallel fault handling, using RCU to delay
     page table reclaim and giving better performance under load.

   - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
     option, which multi-process VMMs such as crosvm rely on (see merge
     commit 382b5b87a97d: "Fix a number of issues with MTE, such as
     races on the tags being initialised vs the PG_mte_tagged flag as
     well as the lack of support for VM_SHARED when KVM is involved.
     Patches from Catalin Marinas and Peter Collingbourne").

   - Merge the pKVM shadow vcpu state tracking that allows the
     hypervisor to have its own view of a vcpu, keeping that state
     private.

   - Add support for the PMUv3p5 architecture revision, bringing support
     for 64bit counters on systems that support it, and fix the
     no-quite-compliant CHAIN-ed counter support for the machines that
     actually exist out there.

   - Fix a handful of minor issues around 52bit VA/PA support (64kB
     pages only) as a prefix of the oncoming support for 4kB and 16kB
     pages.

   - Pick a small set of documentation and spelling fixes, because no
     good merge window would be complete without those.

  s390:

   - Second batch of the lazy destroy patches

   - First batch of KVM changes for kernel virtual != physical address
     support

   - Removal of a unused function

  x86:

   - Allow compiling out SMM support

   - Cleanup and documentation of SMM state save area format

   - Preserve interrupt shadow in SMM state save area

   - Respond to generic signals during slow page faults

   - Fixes and optimizations for the non-executable huge page errata
     fix.

   - Reprogram all performance counters on PMU filter change

   - Cleanups to Hyper-V emulation and tests

   - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
     guest running on top of a L1 Hyper-V hypervisor)

   - Advertise several new Intel features

   - x86 Xen-for-KVM:

      - Allow the Xen runstate information to cross a page boundary

      - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured

      - Add support for 32-bit guests in SCHEDOP_poll

   - Notable x86 fixes and cleanups:

      - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).

      - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
        a few years back when eliminating unnecessary barriers when
        switching between vmcs01 and vmcs02.

      - Clean up vmread_error_trampoline() to make it more obvious that
        params must be passed on the stack, even for x86-64.

      - Let userspace set all supported bits in MSR_IA32_FEAT_CTL
        irrespective of the current guest CPUID.

      - Fudge around a race with TSC refinement that results in KVM
        incorrectly thinking a guest needs TSC scaling when running on a
        CPU with a constant TSC, but no hardware-enumerated TSC
        frequency.

      - Advertise (on AMD) that the SMM_CTL MSR is not supported

      - Remove unnecessary exports

  Generic:

   - Support for responding to signals during page faults; introduces
     new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks

  Selftests:

   - Fix an inverted check in the access tracking perf test, and restore
     support for asserting that there aren't too many idle pages when
     running on bare metal.

   - Fix build errors that occur in certain setups (unsure exactly what
     is unique about the problematic setup) due to glibc overriding
     static_assert() to a variant that requires a custom message.

   - Introduce actual atomics for clear/set_bit() in selftests

   - Add support for pinning vCPUs in dirty_log_perf_test.

   - Rename the so called "perf_util" framework to "memstress".

   - Add a lightweight psuedo RNG for guest use, and use it to randomize
     the access pattern and write vs. read percentage in the memstress
     tests.

   - Add a common ucall implementation; code dedup and pre-work for
     running SEV (and beyond) guests in selftests.

   - Provide a common constructor and arch hook, which will eventually
     be used by x86 to automatically select the right hypercall (AMD vs.
     Intel).

   - A bunch of added/enabled/fixed selftests for ARM64, covering
     memslots, breakpoints, stage-2 faults and access tracking.

   - x86-specific selftest changes:

      - Clean up x86's page table management.

      - Clean up and enhance the "smaller maxphyaddr" test, and add a
        related test to cover generic emulation failure.

      - Clean up the nEPT support checks.

      - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.

      - Fix an ordering issue in the AMX test introduced by recent
        conversions to use kvm_cpu_has(), and harden the code to guard
        against similar bugs in the future. Anything that tiggers
        caching of KVM's supported CPUID, kvm_cpu_has() in this case,
        effectively hides opt-in XSAVE features if the caching occurs
        before the test opts in via prctl().

  Documentation:

   - Remove deleted ioctls from documentation

   - Clean up the docs for the x86 MSR filter.

   - Various fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
  KVM: x86: Add proper ReST tables for userspace MSR exits/flags
  KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
  KVM: arm64: selftests: Align VA space allocator with TTBR0
  KVM: arm64: Fix benign bug with incorrect use of VA_BITS
  KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
  KVM: x86: Advertise that the SMM_CTL MSR is not supported
  KVM: x86: remove unnecessary exports
  KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
  tools: KVM: selftests: Convert clear/set_bit() to actual atomics
  tools: Drop "atomic_" prefix from atomic test_and_set_bit()
  tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
  KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
  perf tools: Use dedicated non-atomic clear/set bit helpers
  tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
  KVM: arm64: selftests: Enable single-step without a "full" ucall()
  KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
  KVM: Remove stale comment about KVM_REQ_UNHALT
  KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
  KVM: Reference to kvm_userspace_memory_region in doc and comments
  KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
  ...
2022-12-15 11:12:21 -08:00
Paolo Bonzini
9352e7470a Merge remote-tracking branch 'kvm/queue' into HEAD
x86 Xen-for-KVM:

* Allow the Xen runstate information to cross a page boundary

* Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured

* add support for 32-bit guests in SCHEDOP_poll

x86 fixes:

* One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).

* Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
   years back when eliminating unnecessary barriers when switching between
   vmcs01 and vmcs02.

* Clean up the MSR filter docs.

* Clean up vmread_error_trampoline() to make it more obvious that params
  must be passed on the stack, even for x86-64.

* Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
  of the current guest CPUID.

* Fudge around a race with TSC refinement that results in KVM incorrectly
  thinking a guest needs TSC scaling when running on a CPU with a
  constant TSC, but no hardware-enumerated TSC frequency.

* Advertise (on AMD) that the SMM_CTL MSR is not supported

* Remove unnecessary exports

Selftests:

* Fix an inverted check in the access tracking perf test, and restore
  support for asserting that there aren't too many idle pages when
  running on bare metal.

* Fix an ordering issue in the AMX test introduced by recent conversions
  to use kvm_cpu_has(), and harden the code to guard against similar bugs
  in the future.  Anything that tiggers caching of KVM's supported CPUID,
  kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
  the caching occurs before the test opts in via prctl().

* Fix build errors that occur in certain setups (unsure exactly what is
  unique about the problematic setup) due to glibc overriding
  static_assert() to a variant that requires a custom message.

* Introduce actual atomics for clear/set_bit() in selftests

Documentation:

* Remove deleted ioctls from documentation

* Various fixes
2022-12-12 15:54:07 -05:00
Paolo Bonzini
eb5618911a KVM/arm64 updates for 6.2
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
   option to keep the good old dirty log around for pages that are
   dirtied by something other than a vcpu.
 
 - Switch to the relaxed parallel fault handling, using RCU to delay
   page table reclaim and giving better performance under load.
 
 - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
   option, which multi-process VMMs such as crosvm rely on.
 
 - Merge the pKVM shadow vcpu state tracking that allows the hypervisor
   to have its own view of a vcpu, keeping that state private.
 
 - Add support for the PMUv3p5 architecture revision, bringing support
   for 64bit counters on systems that support it, and fix the
   no-quite-compliant CHAIN-ed counter support for the machines that
   actually exist out there.
 
 - Fix a handful of minor issues around 52bit VA/PA support (64kB pages
   only) as a prefix of the oncoming support for 4kB and 16kB pages.
 
 - Add/Enable/Fix a bunch of selftests covering memslots, breakpoints,
   stage-2 faults and access tracking. You name it, we got it, we
   probably broke it.
 
 - Pick a small set of documentation and spelling fixes, because no
   good merge window would be complete without those.
 
 As a side effect, this tag also drags:
 
 - The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring
   series
 
 - A shared branch with the arm64 tree that repaints all the system
   registers to match the ARM ARM's naming, and resulting in
   interesting conflicts
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmOODb0PHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDztsQAInRnsgLl57/SpqhZzExNCllN6AT/bdeB3uz
 rnw3ScJOV174uNKp8lnPWoTvu2YUGiVtBp6tFHhDI8le7zHX438ZT8KE5mcs8p5i
 KfFKnb8SHV2DDpqkcy24c0Xl/6vsg1qkKrdfJb49yl5ZakRITDpynW/7tn6dXsxX
 wASeGFdCYeW4g2xMQzsCbtx6LgeQ8uomBmzRfPrOtZHYYxAn6+4Mj4595EC1sWxM
 AQnbp8tW3Vw46saEZAQvUEOGOW9q0Nls7G21YqQ52IA+ZVDK1LmAF2b1XY3edjkk
 pX8EsXOURfqdasBxfSfF3SgnUazoz9GHpSzp1cTVTktrPp40rrT7Ldtml0ktq69d
 1malPj47KVMDsIq0kNJGnMxciXFgAHw+VaCQX+k4zhIatNwviMbSop2fEoxj22jc
 4YGgGOxaGrnvmAJhreCIbr4CkZk5CJ8Zvmtfg+QM6npIp8BY8896nvORx/d4i6tT
 H4caadd8AAR56ANUyd3+KqF3x0WrkaU0PLHJLy1tKwOXJUUTjcpvIfahBAAeUlSR
 qEFrtb+EEMPgAwLfNOICcNkPZR/yyuYvM+FiUQNVy5cNiwFkpztpIctfOFaHySGF
 K07O2/a1F6xKL0OKRUg7hGKknF9ecmux4vHhiUMuIk9VOgNTWobHozBDorLKXMzC
 aWa6oGVC
 =iIPT
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 6.2

- Enable the per-vcpu dirty-ring tracking mechanism, together with an
  option to keep the good old dirty log around for pages that are
  dirtied by something other than a vcpu.

- Switch to the relaxed parallel fault handling, using RCU to delay
  page table reclaim and giving better performance under load.

- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
  option, which multi-process VMMs such as crosvm rely on.

- Merge the pKVM shadow vcpu state tracking that allows the hypervisor
  to have its own view of a vcpu, keeping that state private.

- Add support for the PMUv3p5 architecture revision, bringing support
  for 64bit counters on systems that support it, and fix the
  no-quite-compliant CHAIN-ed counter support for the machines that
  actually exist out there.

- Fix a handful of minor issues around 52bit VA/PA support (64kB pages
  only) as a prefix of the oncoming support for 4kB and 16kB pages.

- Add/Enable/Fix a bunch of selftests covering memslots, breakpoints,
  stage-2 faults and access tracking. You name it, we got it, we
  probably broke it.

- Pick a small set of documentation and spelling fixes, because no
  good merge window would be complete without those.

As a side effect, this tag also drags:

- The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring
  series

- A shared branch with the arm64 tree that repaints all the system
  registers to match the ARM ARM's naming, and resulting in
  interesting conflicts
2022-12-09 09:12:12 +01:00
Marc Zyngier
a937f37d85 Merge branch kvm-arm64/dirty-ring into kvmarm-master/next
* kvm-arm64/dirty-ring:
  : .
  : Add support for the "per-vcpu dirty-ring tracking with a bitmap
  : and sprinkles on top", courtesy of Gavin Shan.
  :
  : This branch drags the kvmarm-fixes-6.1-3 tag which was already
  : merged in 6.1-rc4 so that the branch is in a working state.
  : .
  KVM: Push dirty information unconditionally to backup bitmap
  KVM: selftests: Automate choosing dirty ring size in dirty_log_test
  KVM: selftests: Clear dirty ring states between two modes in dirty_log_test
  KVM: selftests: Use host page size to map ring buffer in dirty_log_test
  KVM: arm64: Enable ring-based dirty memory tracking
  KVM: Support dirty ring in conjunction with bitmap
  KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.h
  KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULL

Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-12-05 14:19:50 +00:00
Sean Christopherson
dd03cc90e0 KVM: Remove stale comment about KVM_REQ_UNHALT
Remove a comment about KVM_REQ_UNHALT being set by kvm_vcpu_check_block()
that was missed when KVM_REQ_UNHALT was dropped.

Fixes: c59fb1275838 ("KVM: remove KVM_REQ_UNHALT")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221201220433.31366-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-02 13:00:28 -05:00
Paolo Bonzini
6c7b2202e4 KVM: x86: avoid memslot check in NX hugepage recovery if it cannot succeed
Since gfn_to_memslot() is relatively expensive, it helps to
skip it if it the memslot cannot possibly have dirty logging
enabled.  In order to do this, add to struct kvm a counter
of the number of log-page memslots.  While the correct value
can only be read with slots_lock taken, the NX recovery thread
is content with using an approximate value.  Therefore, the
counter is an atomic_t.

Based on https://lore.kernel.org/kvm/20221027200316.2221027-2-dmatlack@google.com/
by David Matlack.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18 11:30:12 -05:00
David Matlack
9eb8ca049c KVM: Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL
Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL on every halt,
rather than just sampling the module parameter when the VM is first
created. This restore the original behavior of kvm.halt_poll_ns for VMs
that have not opted into KVM_CAP_HALT_POLL.

Notably, this change restores the ability for admins to disable or
change the maximum halt-polling time system wide for VMs not using
KVM_CAP_HALT_POLL.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Fixes: acd05785e48c ("kvm: add capability for halt polling")
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20221117001657.1067231-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17 10:50:07 -05:00
David Matlack
175d5dc79d KVM: Avoid re-reading kvm->max_halt_poll_ns during halt-polling
Avoid re-reading kvm->max_halt_poll_ns multiple times during
halt-polling except when it is explicitly useful, e.g. to check if the
max time changed across a halt. kvm->max_halt_poll_ns can be changed at
any time by userspace via KVM_CAP_HALT_POLL.

This bug is unlikely to cause any serious side-effects. In the worst
case one halt polls for shorter or longer than it should, and then is
fixed up on the next halt. Furthmore, this is still possible since
kvm->max_halt_poll_ns are not synchronized with halts.

Fixes: acd05785e48c ("kvm: add capability for halt polling")
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20221117001657.1067231-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17 10:49:14 -05:00
David Matlack
97b6847ac1 KVM: Cap vcpu->halt_poll_ns before halting rather than after
Cap vcpu->halt_poll_ns based on the max halt polling time just before
halting, rather than after the last halt. This arguably provides better
accuracy if an admin disables halt polling in between halts, although
the improvement is nominal.

A side-effect of this change is that grow_halt_poll_ns() no longer needs
to access vcpu->kvm->max_halt_poll_ns, which will be useful in a future
commit where the max halt polling time can come from the module parameter
halt_poll_ns instead.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20221117001657.1067231-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17 10:49:06 -05:00
Gavin Shan
c57351a75d KVM: Push dirty information unconditionally to backup bitmap
In mark_page_dirty_in_slot(), we bail out when no running vcpu exists
and a running vcpu context is strictly required by architecture. It may
cause backwards compatible issue. Currently, saving vgic/its tables is
the only known case where no running vcpu context is expected. We may
have other unknown cases where no running vcpu context exists and it's
reported by the warning message and we bail out without pushing the
dirty information to the backup bitmap. For this, the application is
going to enable the backup bitmap for the unknown cases. However, the
dirty information can't be pushed to the backup bitmap even though the
backup bitmap is enabled for those unknown cases in the application,
until the unknown cases are added to the allowed list of non-running
vcpu context with extra code changes to the host kernel.

In order to make the new application, where the backup bitmap has been
enabled, to work with the unchanged host, we continue to push the dirty
information to the backup bitmap instead of bailing out early. With the
added check on 'memslot->dirty_bitmap' to mark_page_dirty_in_slot(), the
kernel crash is avoided silently by the combined conditions: no running
vcpu context, kvm_arch_allow_write_without_running_vcpu() returns 'true',
and the backup bitmap (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) isn't enabled
yet.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221112094322.21911-1-gshan@redhat.com
2022-11-12 11:20:59 +00:00
Gavin Shan
86bdf3ebcf KVM: Support dirty ring in conjunction with bitmap
ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is
enabled. It's conflicting with that ring-based dirty page tracking always
requires a running VCPU context.

Introduce a new flavor of dirty ring that requires the use of both VCPU
dirty rings and a dirty bitmap. The expectation is that for non-VCPU
sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to
the dirty bitmap. Userspace should scan the dirty bitmap before migrating
the VM to the target.

Use an additional capability to advertise this behavior. The newly added
capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before
KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added
capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.

Suggested-by: Marc Zyngier <maz@kernel.org>
Suggested-by: Peter Xu <peterx@redhat.com>
Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110104914.31280-4-gshan@redhat.com
2022-11-10 13:11:58 +00:00
Gavin Shan
cf87ac739e KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULL
The VCPU isn't expected to be runnable when the dirty ring becomes soft
full, until the dirty pages are harvested and the dirty ring is reset
from userspace. So there is a check in each guest's entrace to see if
the dirty ring is soft full or not. The VCPU is stopped from running if
its dirty ring has been soft full. The similar check will be needed when
the feature is going to be supported on ARM64. As Marc Zyngier suggested,
a new event will avoid pointless overhead to check the size of the dirty
ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance.

Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring
becomes soft full in kvm_dirty_ring_push(). The event is only cleared in
the check, done in the newly added helper kvm_dirty_ring_check_request().
Since the VCPU is not runnable when the dirty ring becomes soft full, the
KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent the VCPU from
running until the dirty pages are harvested and the dirty ring is reset by
userspace.

kvm_dirty_ring_soft_full() becomes a private function with the newly added
helper kvm_dirty_ring_check_request(). The alignment for the various event
definitions in kvm_host.h is changed to tab character by the way. In order
to avoid using 'container_of()', the argument @ring is replaced by @vcpu
in kvm_dirty_ring_push().

Link: https://lore.kernel.org/kvmarm/87lerkwtm5.wl-maz@kernel.org
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110104914.31280-2-gshan@redhat.com
2022-11-10 13:11:57 +00:00
Peter Xu
c8b88b332b kvm: Add interruptible flag to __gfn_to_pfn_memslot()
Add a new "interruptible" flag showing that the caller is willing to be
interrupted by signals during the __gfn_to_pfn_memslot() request.  Wire it
up with a FOLL_INTERRUPTIBLE flag that we've just introduced.

This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the
SIGIPI) even during e.g. handling an userfaultfd page fault.

No functional change intended.

Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221011195809.557016-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 12:31:27 -05:00
Peter Xu
fe5ed56c79 kvm: Add KVM_PFN_ERR_SIGPENDING
Add a new pfn error to show that we've got a pending signal to handle
during hva_to_pfn_slow() procedure (of -EINTR retval).

Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221011195809.557016-3-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 12:31:27 -05:00
Paolo Bonzini
f4298cac2b Merge tag 'kvmarm-fixes-6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
* Fix the pKVM stage-1 walker erronously using the stage-2 accessor

* Correctly convert vcpu->kvm to a hyp pointer when generating
  an exception in a nVHE+MTE configuration

* Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them

* Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE

* Document the boot requirements for FGT when entering the kernel
  at EL1
2022-11-06 03:30:49 -05:00
Gavin Shan
7a2726ec32 KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling them
There are two capabilities related to ring-based dirty page tracking:
KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Both are
supported by x86. However, arm64 supports KVM_CAP_DIRTY_LOG_RING_ACQ_REL
only when the feature is supported on arm64. The userspace doesn't have
to enable the advertised capability, meaning KVM_CAP_DIRTY_LOG_RING can
be enabled on arm64 by userspace and it's wrong.

Fix it by double checking if the capability has been advertised prior to
enabling it. It's rejected to enable the capability if it hasn't been
advertised.

Fixes: 17601bfed909 ("KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option")
Reported-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221031003621.164306-4-gshan@redhat.com
2022-10-31 17:22:15 +00:00
Hou Wenlong
180418e2eb KVM: debugfs: Return retval of simple_attr_open() if it fails
Although simple_attr_open() fails only with -ENOMEM with current code
base, it would be nicer to return retval of simple_attr_open() directly
in kvm_debugfs_open().

No functional change intended.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Message-Id: <69d64d93accd1f33691b8a383ae555baee80f943.1665975828.git.houwenlong.hwl@antgroup.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27 04:41:55 -04:00
Alexander Graf
ed51862f2f kvm: Add support for arch compat vm ioctls
We will introduce the first architecture specific compat vm ioctl in the
next patch. Add all necessary boilerplate to allow architectures to
override compat vm ioctls when necessary.

Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20221017184541.2658-2-graf@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22 05:15:23 -04:00
Paolo Bonzini
fe4d9e4abf KVM/arm64 updates for v6.1
- Fixes for single-stepping in the presence of an async
   exception as well as the preservation of PSTATE.SS
 
 - Better handling of AArch32 ID registers on AArch64-only
   systems
 
 - Fixes for the dirty-ring API, allowing it to work on
   architectures with relaxed memory ordering
 
 - Advertise the new kvmarm mailing list
 
 - Various minor cleanups and spelling fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmM5hQcPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDoMUP/jra4HSmujLUB5G7Op8HxuurEecOc6xtw0Af
 AbDLlVc2Vs4rrdVh8GMc8D80atUAVitp8IFjdp/PzI2GTBTzWz43Gav2AbhgIJbJ
 xoFVHL8LkdHKyMbq10359DqGMqhIf41OFzGwhbzcx2V4pKNkSpjbCpu3bi/+Ybjg
 006ZpZc7NAU0rZgw9Flb/dhn0jw7RMc3orhoDQ4tBp1P/VhvqvgFt5bWipkvvBP7
 +lQK28ujG3ghST/hKRhg6ozgy5+6NEEHMuhErMYP8nIivRchX+pWF2Lb0qGH1e+U
 v2MZIZnIIUjyTV1vbYlxtltzfYmPuQ2MFNUBawI9tmlIOU9vJSCzeJS64uWK4KLV
 kbmk57OfC7rQoSNJH4jaKQp0YpIktrB9Vei97t4I7NwEmkjQj6cLTgg4tQrNqTiQ
 cFGeC9mE+lhFC8z1lCbna2eG631FxpPrB1SJ1/CU9wboam9dUfXGIvBPh+i2pvMZ
 vcxzUZJ11y+/uhp4k8i2PBwNno0iwRXd5MinwRUs2CR5vhs8qa5y7FVWKyqKpgI2
 xqr4lYTixJZL3mWkYyOQuClrTbT1zkoaPldLq6M7wvO08+QV8ryMeyKT+9s/gNQU
 dcYSwBCWZaOZm2nN8/zjxRb7VqZVu3cwyXi9XXUWNTCgIe/Q/SDPbXU/Hwbgzf8X
 UsQF7e9A
 =aNPK
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for v6.1

- Fixes for single-stepping in the presence of an async
  exception as well as the preservation of PSTATE.SS

- Better handling of AArch32 ID registers on AArch64-only
  systems

- Fixes for the dirty-ring API, allowing it to work on
  architectures with relaxed memory ordering

- Advertise the new kvmarm mailing list

- Various minor cleanups and spelling fixes
2022-10-03 15:33:32 -04:00
Marc Zyngier
17601bfed9 KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option
In order to differenciate between architectures that require no extra
synchronisation when accessing the dirty ring and those who do,
add a new capability (KVM_CAP_DIRTY_LOG_RING_ACQ_REL) that identify
the latter sort. TSO architectures can obviously advertise both, while
relaxed architectures must only advertise the ACQ_REL version.

This requires some configuration symbol rejigging, with HAVE_KVM_DIRTY_RING
being only indirectly selected by two top-level config symbols:
- HAVE_KVM_DIRTY_RING_TSO for strongly ordered architectures (x86)
- HAVE_KVM_DIRTY_RING_ACQ_REL for weakly ordered architectures (arm64)

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20220926145120.27974-3-maz@kernel.org
2022-09-29 10:23:08 +01:00
Paolo Bonzini
c59fb12758 KVM: remove KVM_REQ_UNHALT
KVM_REQ_UNHALT is now unnecessary because it is replaced by the return
value of kvm_vcpu_block/kvm_vcpu_halt.  Remove it.

No functional change intended.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20220921003201.1441511-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:37:21 -04:00
Miaohe Lin
5a2a961be2 KVM: fix memoryleak in kvm_init()
When alloc_cpumask_var_node() fails for a certain cpu, there might be some
allocated cpumasks for percpu cpu_kick_mask. We should free these cpumasks
or memoryleak will occur.

Fixes: baff59ccdc65 ("KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except()")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Link: https://lore.kernel.org/r/20220823063414.59778-1-linmiaohe@huawei.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:27 -04:00
Li kunyu
eceb6e1d53 KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device()
The variable is initialized but it is only used after its assignment.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Message-Id: <20220819021535.483702-1-kunyu@nfschina.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 04:05:43 -04:00
Li kunyu
2824913976 KVM: Drop unnecessary initialization of "npages" in hva_to_pfn_slow()
The variable is initialized but it is only used after its assignment.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Message-Id: <20220819022804.483914-1-kunyu@nfschina.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 04:05:43 -04:00
Chao Peng
20ec3ebd70 KVM: Rename mmu_notifier_* to mmu_invalidate_*
The motivation of this renaming is to make these variables and related
helper functions less mmu_notifier bound and can also be used for non
mmu_notifier based page invalidation. mmu_invalidate_* was chosen to
better describe the purpose of 'invalidating' a page that those
variables are used for.

  - mmu_notifier_seq/range_start/range_end are renamed to
    mmu_invalidate_seq/range_start/range_end.

  - mmu_notifier_retry{_hva} helper functions are renamed to
    mmu_invalidate_retry{_hva}.

  - mmu_notifier_count is renamed to mmu_invalidate_in_progress to
    avoid confusion with mn_active_invalidate_count.

  - While here, also update kvm_inc/dec_notifier_count() to
    kvm_mmu_invalidate_begin/end() to match the change for
    mmu_notifier_count.

No functional change intended.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 04:05:41 -04:00
Sean Christopherson
c2b8239701 KVM: Move coalesced MMIO initialization (back) into kvm_create_vm()
Invoke kvm_coalesced_mmio_init() from kvm_create_vm() now that allocating
and initializing coalesced MMIO objects is separate from registering any
associated devices.  Moving coalesced MMIO cleans up the last oddity
where KVM does VM creation/initialization after kvm_create_vm(), and more
importantly after kvm_arch_post_init_vm() is called and the VM is added
to the global vm_list, i.e. after the VM is fully created as far as KVM
is concerned.

Originally, kvm_coalesced_mmio_init() was called by kvm_create_vm(), but
the original implementation was completely devoid of error handling.
Commit 6ce5a090a9a0 ("KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s
error handling" fixed the various bugs, and in doing so rightly moved the
call to after kvm_create_vm() because kvm_coalesced_mmio_init() also
registered the coalesced MMIO device.  Commit 2b3c246a682c ("KVM: Make
coalesced mmio use a device per zone") cleaned up that mess by having
each zone register a separate device, i.e. moved device registration to
its logical home in kvm_vm_ioctl_register_coalesced_mmio().  As a result,
kvm_coalesced_mmio_init() is now a "pure" initialization helper and can
be safely called from kvm_create_vm().

Opportunstically drop the #ifdef, KVM provides stubs for
kvm_coalesced_mmio_{init,free}() when CONFIG_KVM_MMIO=n (s390).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220816053937.2477106-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 04:02:31 -04:00
Sean Christopherson
405294f29f KVM: Unconditionally get a ref to /dev/kvm module when creating a VM
Unconditionally get a reference to the /dev/kvm module when creating a VM
instead of using try_get_module(), which will fail if the module is in
the process of being forcefully unloaded.  The error handling when
try_get_module() fails doesn't properly unwind all that has been done,
e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
from the global list.  Not removing VMs from the global list tends to be
fatal, e.g. leads to use-after-free explosions.

The obvious alternative would be to add proper unwinding, but the
justification for using try_get_module(), "rmmod --wait", is completely
bogus as support for "rmmod --wait", i.e. delete_module() without
O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod
--wait option.") nearly a decade ago.

It's still possible for try_get_module() to fail due to the module dying
(more like being killed), as the module will be tagged MODULE_STATE_GOING
by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
with forced unloading is an exercise in futility and gives a falsea sense
of security.  Using try_get_module() only prevents acquiring _new_
references, it doesn't magically put the references held by other VMs,
and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
guaranteed to cause spectacular fireworks; the window where KVM will fail
try_get_module() is tiny compared to the window where KVM is building and
running the VM with an elevated module refcount.

Addressing KVM's inability to play nice with "rmmod --force" is firmly
out-of-scope.  Forcefully unloading any module taints kernel (for obvious
reasons)  _and_ requires the kernel to be built with
CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
amusing disclaimer that it's "mainly for kernel developers and desperate
users".  In other words, KVM is free to scoff at bug reports due to using
"rmmod --force" while VMs may be running.

Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
Cc: stable@vger.kernel.org
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220816053937.2477106-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 04:02:31 -04:00
Sean Christopherson
4ba4f41942 KVM: Properly unwind VM creation if creating debugfs fails
Properly unwind VM creation if kvm_create_vm_debugfs() fails.  A recent
change to invoke kvm_create_vm_debug() in kvm_create_vm() was led astray
by buggy try_get_module() handling adding by commit 5f6de5cbebee ("KVM:
Prevent module exit until all VMs are freed").  The debugfs error path
effectively inherits the bad error path of try_module_get(), e.g. KVM
leaves the to-be-free VM on vm_list even though KVM appears to do the
right thing by calling module_put() and falling through.

Opportunistically hoist kvm_create_vm_debugfs() above the call to
kvm_arch_post_init_vm() so that the "post-init" arch hook is actually
invoked after the VM is initialized (ignoring kvm_coalesced_mmio_init()
for the moment).  x86 is the only non-nop implementation of the post-init
hook, and it doesn't allocate/initialize any objects that are reachable
via debugfs code (spawns a kthread worker for the NX huge page mitigation).

Leave the buggy try_get_module() alone for now, it will be fixed in a
separate commit.

Fixes: b74ed7a68ec1 ("KVM: Actually create debugfs in kvm_create_vm()")
Reported-by: syzbot+744e173caec2e1627ee0@syzkaller.appspotmail.com
Cc: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Message-Id: <20220816053937.2477106-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 04:02:30 -04:00
Oliver Upton
b74ed7a68e KVM: Actually create debugfs in kvm_create_vm()
Doing debugfs creation after vm creation leaves things in a
quasi-initialized state for a while. This is further complicated by the
fact that we tear down debugfs from kvm_destroy_vm(). Align debugfs and
stats init/destroy with the vm init/destroy pattern to avoid any
headaches.

Note the fix for a benign mistake in error handling for calls to
kvm_arch_create_vm_debugfs() rolled in. Since all implementations of
the function return 0 unconditionally it isn't actually a bug at
the moment.

Lastly, tear down debugfs/stats data in the kvm_create_vm_debugfs()
error path. Previously it was safe to assume that kvm_destroy_vm() would
take out the garbage, that is no longer the case.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220720092259.3491733-6-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:28 -04:00
Oliver Upton
59f82aad59 KVM: Pass the name of the VM fd to kvm_create_vm_debugfs()
At the time the VM fd is used in kvm_create_vm_debugfs(), the fd has
been allocated but not yet installed. It is only really useful as an
identifier in strings for the VM (such as debugfs).

Treat it exactly as such by passing the string name of the fd to
kvm_create_vm_debugfs(), futureproofing against possible misuse of the
VM fd.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220720092259.3491733-5-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:28 -04:00
Oliver Upton
20020f4cf2 KVM: Get an fd before creating the VM
Allocate a VM's fd at the very beginning of kvm_dev_ioctl_create_vm() so
that KVM can use the fd value to generate strigns, e.g. for debugfs,
when creating and initializing the VM.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220720092259.3491733-4-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:27 -04:00
Oliver Upton
58fc116645 KVM: Shove vcpu stats_id init into kvm_vcpu_init()
Initialize stats_id alongside other kvm_vcpu fields to make it more
difficult to unintentionally access stats_id before it's set.

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220720092259.3491733-3-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:27 -04:00
Oliver Upton
f2759c08d8 KVM: Shove vm stats_id init into kvm_create_vm()
Initialize stats_id alongside other struct kvm fields to make it more
difficult to unintentionally access stats_id before it's set.  While at
it, move the format string to the first line of the call and fix the
indentation of the second line.

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220720092259.3491733-2-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:27 -04:00
Paolo Bonzini
63f4b21041 Merge remote-tracking branch 'kvm/next' into kvm-next-5.20
KVM/s390, KVM/x86 and common infrastructure changes for 5.20

x86:

* Permit guests to ignore single-bit ECC errors

* Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache

* Intel IPI virtualization

* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS

* PEBS virtualization

* Simplify PMU emulation by just using PERF_TYPE_RAW events

* More accurate event reinjection on SVM (avoid retrying instructions)

* Allow getting/setting the state of the speaker port data bit

* Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent

* "Notify" VM exit (detect microarchitectural hangs) for Intel

* Cleanups for MCE MSR emulation

s390:

* add an interface to provide a hypervisor dump for secure guests

* improve selftests to use TAP interface

* enable interpretive execution of zPCI instructions (for PCI passthrough)

* First part of deferred teardown

* CPU Topology

* PV attestation

* Minor fixes

Generic:

* new selftests API using struct kvm_vcpu instead of a (vm, id) tuple

x86:

* Use try_cmpxchg64 instead of cmpxchg64

* Bugfixes

* Ignore benign host accesses to PMU MSRs when PMU is disabled

* Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior

* x86/MMU: Allow NX huge pages to be disabled on a per-vm basis

* Port eager page splitting to shadow MMU as well

* Enable CMCI capability by default and handle injected UCNA errors

* Expose pid of vcpu threads in debugfs

* x2AVIC support for AMD

* cleanup PIO emulation

* Fixes for LLDT/LTR emulation

* Don't require refcounted "struct page" to create huge SPTEs

x86 cleanups:

* Use separate namespaces for guest PTEs and shadow PTEs bitmasks

* PIO emulation

* Reorganize rmap API, mostly around rmap destruction

* Do not workaround very old KVM bugs for L0 that runs with nesting enabled

* new selftests API for CPUID
2022-08-01 03:21:00 -04:00
Anup Patel
4ab0e470c0 KVM: Add gfp_custom flag in struct kvm_mmu_memory_cache
The kvm_mmu_topup_memory_cache() always uses GFP_KERNEL_ACCOUNT for
memory allocation which prevents it's use in atomic context. To address
this limitation of kvm_mmu_topup_memory_cache(), we add gfp_custom flag
in struct kvm_mmu_memory_cache. When the gfp_custom flag is set to some
GFP_xyz flags, the kvm_mmu_topup_memory_cache() will use that instead of
GFP_KERNEL_ACCOUNT.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
2022-07-29 17:15:00 +05:30
Vineeth Pillai
e36de87d34 KVM: debugfs: expose pid of vcpu threads
Add a new debugfs file to expose the pid of each vcpu threads. This
is very helpful for userland tools to get the vcpu pids without
worrying about thread naming conventions of the VMM.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Message-Id: <20220523190327.2658-1-vineeth@bitbyteword.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:44:33 -04:00
David Matlack
837f66c712 KVM: Allow for different capacities in kvm_mmu_memory_cache structs
Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

In order to support different capacities, this commit changes the
objects pointer array to be dynamically allocated the first time the
cache is topped-up.

While here, opportunistically clean up the stack-allocated
kvm_mmu_memory_cache structs in riscv and arm64 to use designated
initializers.

No functional change intended.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-22-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:00 -04:00
Sean Christopherson
943dfea8f1 KVM: Do not zero initialize 'pfn' in hva_to_pfn()
Drop the unnecessary initialization of the local 'pfn' variable in
hva_to_pfn().  First and foremost, '0' is not an invalid pfn, it's a
perfectly valid pfn on most architectures.  I.e. if hva_to_pfn() were to
return an "uninitializd" pfn, it would actually be interpeted as a legal
pfn by most callers.

Second, hva_to_pfn() can't return an uninitialized pfn as hva_to_pfn()
explicitly sets pfn to an error value (or returns an error value directly)
if a helper returns failure, and all helpers set the pfn on success.

The zeroing of 'pfn' was introduced by commit 2fc843117d64 ("KVM:
reorganize hva_to_pfn"), probably to avoid "uninitialized variable"
warnings on statements that return pfn.  However, no compiler seems
to produce them, making the initialization unnecessary.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:36 -04:00
Sean Christopherson
b14b2690c5 KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()
Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()
to better reflect what KVM is actually checking, and to eliminate extra
pfn_to_page() lookups.  The kvm_release_pfn_*() an kvm_try_get_pfn()
helpers in particular benefit from "refouncted" nomenclature, as it's not
all that obvious why KVM needs to get/put refcounts for some PG_reserved
pages (ZERO_PAGE and ZONE_DEVICE).

Add a comment to call out that the list of exceptions to PG_reserved is
all but guaranteed to be incomplete.  The list has mostly been compiled
by people throwing noodles at KVM and finding out they stick a little too
well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages
didn't get freed.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:35 -04:00
Sean Christopherson
284dc49307 KVM: Take a 'struct page', not a pfn in kvm_is_zone_device_page()
Operate on a 'struct page' instead of a pfn when checking if a page is a
ZONE_DEVICE page, and rename the helper accordingly.  Generally speaking,
KVM doesn't actually care about ZONE_DEVICE memory, i.e. shouldn't do
anything special for ZONE_DEVICE memory.  Rather, KVM wants to treat
ZONE_DEVICE memory like regular memory, and the need to identify
ZONE_DEVICE memory only arises as an exception to PG_reserved pages. In
other words, KVM should only ever check for ZONE_DEVICE memory after KVM
has already verified that there is a struct page associated with the pfn.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:34 -04:00
Sean Christopherson
b1624f99aa KVM: Remove kvm_vcpu_gfn_to_page() and kvm_vcpu_gpa_to_page()
Drop helpers to convert a gfn/gpa to a 'struct page' in the context of a
vCPU.  KVM doesn't require that guests be backed by 'struct page' memory,
thus any use of helpers that assume 'struct page' is bound to be flawed,
as was the case for the recently removed last user in x86's nested VMX.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:34 -04:00
Sean Christopherson
6573a6910c KVM: Don't WARN if kvm_pfn_to_page() encounters a "reserved" pfn
Drop a WARN_ON() if kvm_pfn_to_page() encounters a "reserved" pfn, which
in this context means a struct page that has PG_reserved but is not a/the
ZERO_PAGE and is not a ZONE_DEVICE page.  The usage, via gfn_to_page(),
in x86 is safe as gfn_to_page() is used only to retrieve a page from
KVM-controlled memslot, but the usage in PPC and s390 operates on
arbitrary gfns and thus memslots that can be backed by incompatible
memory.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:33 -04:00
Sean Christopherson
8e1c69149f KVM: Avoid pfn_to_page() and vice versa when releasing pages
Invert the order of KVM's page/pfn release helpers so that the "inner"
helper operates on a page instead of a pfn.  As pointed out by Linus[*],
converting between struct page and a pfn isn't necessarily cheap, and
that's not even counting the overhead of is_error_noslot_pfn() and
kvm_is_reserved_pfn().  Even if the checks were dirt cheap, there's no
reason to convert from a page to a pfn and back to a page, just to mark
the page dirty/accessed or to put a reference to the page.

Opportunistically drop a stale declaration of kvm_set_page_accessed()
from kvm_host.h (there was no implementation).

No functional change intended.

[*] https://lore.kernel.org/all/CAHk-=wifQimj2d6npq-wCi5onYPjzQg4vyO4tFcPJJZr268cRw@mail.gmail.com

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:32 -04:00
Sean Christopherson
a1040b0d42 KVM: Don't set Accessed/Dirty bits for ZERO_PAGE
Don't set Accessed/Dirty bits for a struct page with PG_reserved set,
i.e. don't set A/D bits for the ZERO_PAGE.  The ZERO_PAGE (or pages
depending on the architecture) should obviously never be written, and
similarly there's no point in marking it accessed as the page will never
be swapped out or reclaimed.  The comment in page-flags.h is quite clear
that PG_reserved pages should be managed only by their owner, and
strictly following that mandate also simplifies KVM's logic.

Fixes: 7df003c85218 ("KVM: fix overflow of zero page refcount with ksm running")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:31 -04:00
Sean Christopherson
28b85ae06f KVM: Drop bogus "pfn != 0" guard from kvm_release_pfn()
Remove a check from kvm_release_pfn() to bail if the provided @pfn is
zero.  Zero is a perfectly valid pfn on most architectures, and should
not be used to indicate an error or an invalid pfn.  The bogus check was
added by commit 917248144db5 ("x86/kvm: Cache gfn to pfn translation"),
which also did the bad thing of zeroing the pfn and gfn to mark a cache
invalid.  Thankfully, that bad behavior was axed by commit 357a18ad230f
("KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache").

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:31 -04:00
Lai Jiangshan
f24b44e48d KVM: Rename ack_flush() to ack_kick()
Make it use the same verb as in kvm_kick_many_cpus().

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220605063417.308311-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-15 08:07:53 -04:00
Paolo Bonzini
e15f5e6fa6 Merge branch 'kvm-5.20-early'
s390:

* add an interface to provide a hypervisor dump for secure guests

* improve selftests to show tests

x86:

* Intel IPI virtualization

* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS

* PEBS virtualization

* Simplify PMU emulation by just using PERF_TYPE_RAW events

* More accurate event reinjection on SVM (avoid retrying instructions)

* Allow getting/setting the state of the speaker port data bit

* Rewrite gfn-pfn cache refresh

* Refuse starting the module if VM-Entry/VM-Exit controls are inconsistent

* "Notify" VM exit
2022-06-09 11:38:12 -04:00
Maxim Levitsky
18869f26df KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking
On SVM, if preemption happens right after the call to finish_rcuwait
but before call to kvm_arch_vcpu_unblocking on SVM/AVIC, it itself
will re-enable AVIC, and then we will try to re-enable it again
in kvm_arch_vcpu_unblocking which will lead to a warning
in __avic_vcpu_load.

The same problem can happen if the vCPU is preempted right after the call
to kvm_arch_vcpu_blocking but before the call to prepare_to_rcuwait
and in this case, we will end up with AVIC enabled during sleep -
Ooops.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:20 -04:00
Paolo Bonzini
66da65005a KVM/riscv fixes for 5.19, take #1
- Typo fix in arch/riscv/kvm/vmid.c
 
 - Remove broken reference pattern from MAINTAINERS entry
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmKhgGkACgkQrUjsVaLH
 LAfCCw/+LEz6af/lm4PXr5CGkJ91xq2xMpU9o39jMBm0RltzsG7zt/90SaUdK/Oz
 MpX7CLFgb1Cm2ZZ/+l5cBlLc7NUaMMxHH9dpScyrYC8xAb75QYimpe/jfjuMyXjO
 IaYJB2WCs2gfTYXA58c4sB2WR5rLahLnQGJrwW2CfMSvpv/nAyEZyWYtgXw8tSxH
 oM06Z/cLWU53uWuX0hbKAVQMdAIrQK5H+z46bhbpFC6gk/XSvaBBEngoOiiE6lC6
 uM8i8ZIeUgqSeWWreczd6H25eYwyLuVxXHWSIgbdvEcvBUn0VDO+Ox4UA2ab3g3d
 uHubqdRY5GnrkbRK0ue6tXfON8NxGlKwlcc6kp9Vqxb3Jxjr2qwToTubHYAUVXUi
 XzrvSxoZRRikwstb1+PNXECCNYUHkNdj4FBA4WoF0Y3Br1IfSwZLUX+EKkY/DHv+
 L4MhFFNqsQPzVly2wNiyxuWwRQyxupHekeMQlp13P9vZnGcptxxEyuQlM1Hf40ST
 iiOC8L+TCQzc5dN156/KjQIUFPud4huJO+0xHQtang628yVzQazzcxD+ialPkcqt
 JnpMmNbvvNzFYLoB3dQ/36flmYRA6SbK4Tt4bdhls+UcnLnfHDZow7OLmX5yj8+A
 QiKx6IOS6KI10LXhVZguAmZuKjXajyLVaCWpBl0tV6XpV9Y5t98=
 =w6dT
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-fixes-5.19-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv fixes for 5.19, take #1

- Typo fix in arch/riscv/kvm/vmid.c

- Remove broken reference pattern from MAINTAINERS entry
2022-06-09 09:45:00 -04:00