IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
kvm_faultin_pfn() to signal that the page fault handler should continue
doing its thing. Aside from being gross and inefficient, using a boolean
return to signal continue vs. stop makes it extremely difficult to add
more helpers and/or move existing code to a helper.
E.g. hypothetically, if nested MMUs were to gain a separate page fault
handler in the future, everything up to the "is self-modifying PTE" check
can be shared by all shadow MMUs, but communicating up the stack whether
to continue on or stop becomes a nightmare.
More concretely, proposed support for private guest memory ran into a
similar issue, where it'll be forced to forego a helper in order to yield
sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.
No functional change intended.
Cc: David Matlack <dmatlack@google.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
faults in the access tracking case, and drop the exec/NX check that
becomes redundant as a result. No sane hardware will generate an access
that is both an instruct fetch and a write, i.e. it's a waste of cycles.
If hardware goes off the rails, or KVM runs under a misguided hypervisor,
spuriously running throught fast path is benign (KVM has been uknowingly
being doing exactly that for years).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path. Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.
In other words, don't attempt the fast path just because EPT is enabled.
Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios. Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.
The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults. Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.
On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles. So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.
Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We are dropping A/D bits (and W bits) in the TDP MMU. Even if mmu_lock
is held for write, as volatile SPTEs can be written by other tasks/vCPUs
outside of mmu_lock.
Attempting to prove that bug exposed another notable goof, which has been
lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs
as volatile, even though KVM never clears WRITABLE outside of MMU lock.
As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to
update writable SPTEs.
The fix does not seem to have an easily-measurable affect on performance;
page faults are so slow that wasting even a few hundred cycles is dwarfed
by the base cost.
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
and into its callers. Well, caller, since only one of its two callers
doesn't already do the shadow-present check.
Opportunistically move the helper to spte.c/h so that it can be used by
the TDP MMU, which is also the primary motivation for the shadow-present
change. Unlike the legacy MMU, the TDP MMU uses a single path for clear
leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
MMU will need to check is_last_spte() prior to calling
spte_has_volatile_bits(), and calling is_last_spte() without first
calling is_shadow_present_spte() is at best odd, and at worst a violation
of KVM's loosely defines SPTE rules.
Note, mmu_spte_clear_track_bits() could likely skip the write entirely
for SPTEs that are not shadow-present. Leave that cleanup for a future
patch to avoid introducing a functional change, and because the
shadow-present check can likely be moved further up the stack, e.g.
drop_large_spte() appears to be the only path that doesn't already
explicitly check for a shadow-present SPTE.
No functional change intended.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
being volatile (unless they're volatile for other reasons, e.g. A/D bits).
KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
cleared just because the SPTE is MMU-writable.
Rename the wrapper of MMU-writable to be more literal, the previous name
of spte_can_locklessly_be_made_writable() is wrong and misleading.
Fixes: c7ba5b48cc8d ("KVM: MMU: fast path of handling guest page fault")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is
allocated with role.level = 5 and the guest pagetable's root gfn.
And root_sp->spt[0] is also allocated with the same gfn and the same
role except role.level = 4. Luckily that they are different shadow
pages, but only root_sp->spt[0] is the real translation of the guest
pagetable.
Here comes a problem:
If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
and uses the same gfn as the root page for nested NPT before and after
switching gCR4_LA57. The host (hCR4_LA57=1) might use the same root_sp
for the guest even the guest switches gCR4_LA57. The guest will see
unexpected page mapped and L2 may exploit the bug and hurt L1. It is
lucky that the problem can't hurt L0.
And three special cases need to be handled:
The root_sp should be like role.direct=1 sometimes: its contents are
not backed by gptes, root_sp->gfns is meaningless. (For a normal high
level sp in shadow paging, sp->gfns is often unused and kept zero, but
it could be relevant and meaningful if sp->gfns is used because they
are backed by concrete gptes.)
For such root_sp in the case, root_sp is just a portal to contribute
root_sp->spt[0], and root_sp->gfns should not be used and
root_sp->spt[0] should not be dropped if gpte[0] of the guest root
pagetable is changed.
Such root_sp should not be accounted too.
So add role.passthrough to distinguish the shadow pages in the hash
when gCR4_LA57 is toggled and fix above special cases by using it in
kvm_mmu_page_{get|set}_gfn() and sp_has_gptes().
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220420131204.2850-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add sp_has_gptes() which equals to !sp->role.direct currently.
Shadow page having gptes needs to be write-protected, accounted and
responded to kvm_mmu_pte_write().
Use it in these places to replace !sp->role.direct and rename
for_each_gfn_indirect_valid_sp.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220420131204.2850-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
direct_map is always equal to the direct field of the root page's role:
- for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is
copied from cpu_role.base.direct
- for TDP, it is always true and root_role.direct is also always true
- for shadow TDP, it is always false and root_role.direct is also always
false
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove another duplicate field of struct kvm_mmu. This time it's
the root level for page table walking; the separate field is
always initialized as cpu_role.base.level, so its users can look
up the CPU mode directly instead.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
root_role.level is always the same value as shadow_level:
- it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu
- it's the level argument when going through kvm_init_shadow_ept_mmu
- it's assigned directly from new_role.base.level when going
through shadow_mmu_init_context
Remove the duplication and get the level directly from the role.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not lead init_kvm_*mmu into the temptation of poking
into struct kvm_mmu_role_regs, by passing to it directly
the CPU mode.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Shadow MMUs compute their role from cpu_role.base, simply by adjusting
the root level. It's one line of code, so do not place it in a separate
function.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Before the separation of the CPU and the MMU role, CR0.PG was not
available in the base MMU role, because two-dimensional paging always
used direct=1 in the MMU role. However, now that the raw role is
snapshotted in mmu->cpu_role, the value of CR0.PG always matches both
!cpu_role.base.direct and cpu_role.base.level > 0. There is no need to
store it again in union kvm_mmu_extended_role; instead, write an is_cr0_pg
accessor by hand that takes care of the conversion. Use cpu_role.base.level
since the future of the direct field is unclear.
Likewise, CR4.PAE is now always present in the CPU role as
!cpu_role.base.has_4_byte_gpte. The inversion makes certain tests on
the MMU role easier, and is easily hidden by the is_cr4_pae accessor
when operating on the CPU role.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is quite confusing that the "full" union is called kvm_mmu_role
but is used for the "cpu_role" field of struct kvm_mmu. Rename it
to kvm_cpu_role.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
mmu_role represents the role of the root of the page tables.
It does not need any extended bits, as those govern only KVM's
page table walking; the is_* functions used for page table
walking always use the CPU role.
ext.valid is not present anymore in the MMU role, but an
all-zero MMU role is impossible because the level field is
never zero in the MMU role. So just zap the whole mmu_role
in order to force invalidation after CPUID is updated.
While making this change, which requires touching almost every
occurrence of "mmu_role", rename it to "root_role".
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that the MMU role is separate from the CPU role, it can be a
truthful description of the format of the shadow pages. This includes
whether the shadow pages use the NX bit; so force the efer_nx field
of the MMU role when TDP is disabled, and remove the hardcoding it in
the callers of reset_shadow_zero_bits_mask.
In fact, the initialization of reserved SPTE bits can now be made common
to shadow paging and shadow NPT; move it to shadow_mmu_init_context.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Inline kvm_calc_mmu_role_common into its sole caller, and simplify it
by removing the computation of unnecessary bits.
Extended bits are unnecessary because page walking uses the CPU role,
and EFER.NX/CR0.WP can be set to one unconditionally---matching the
format of shadow pages rather than the format of guest pages.
The MMU role for two dimensional paging does still depend on the CPU role,
even if only barely so, due to SMM and guest mode; for consistency,
pass it down to kvm_calc_tdp_mmu_root_page_role instead of querying
the vcpu with is_smm or is_guest_mode.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_calc_shadow_root_page_role_common is the same as
kvm_calc_cpu_role except for the level, which is overwritten
afterwards in kvm_calc_shadow_mmu_root_page_role
and kvm_calc_shadow_npt_root_page_role.
role.base.direct is already set correctly for the CPU role,
and CR0.PG=1 is required for VMRUN so it will also be
correct for nested NPT.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The ept_ad field is used during page walk to determine if the guest PTEs
have accessed and dirty bits. In the MMU role, the ad_disabled
bit represents whether the *shadow* PTEs have the bits, so it
would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just
!mmu->mmu_role.base.ad_disabled.
However, the similar field in the CPU mode, ad_disabled, is initialized
correctly: to the opposite value of ept_ad for shadow EPT, and zero
for non-EPT guest paging modes (which always have A/D bits). It is
therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode,
like other page-format fields; it just has to be inverted to account
for the different polarity.
In fact, now that the CPU mode is distinct from the MMU roles, it would
even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and
use !mmu->cpu_role.base.ad_disabled instead. I am not doing this because
the macro has a small effect in terms of dead code elimination:
text data bss dec hex
103544 16665 112 120321 1d601 # as of this patch
103746 16665 112 120523 1d6cb # without PT_HAVE_ACCESSED_DIRTY
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The root_level can be found in the cpu_role (in fact the field
is superfluous and could be removed, but one thing at a time).
Since there is only one usage left of role_regs_to_root_level,
inline it into kvm_calc_cpu_role.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Snapshot the state of the processor registers that govern page walk into
a new field of struct kvm_mmu. This is a more natural representation
than having it *mostly* in mmu_role but not exclusively; the delta
right now is represented in other fields, such as root_level.
The nested MMU now has only the CPU role; and in fact the new function
kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role,
except that it has role.base.direct equal to !CR0.PG. For a walk-only
MMU, "direct" has no meaning, but we set it to !CR0.PG so that
role.ext.cr0_pg can go away in a future patch.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The argument is always false now that kvm_mmu_calc_root_page_role has
been removed.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If accessed bits are not supported there simple isn't any distinction
between accessed and non-accessed gPTEs, so the comment does not make
much sense. Rephrase it in terms of what happens if accessed bits
*are* supported.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The init_kvm_*mmu functions, with the exception of shadow NPT,
do not need to know the full values of CR0/CR4/EFER; they only
need to know the bits that make up the "role". This cleanup
however will take quite a few incremental steps. As a start,
pull the common computation of the struct kvm_mmu_role_regs
into their caller: all of them extract the struct from the vcpu
as the very first step.
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
struct kvm_mmu_role_regs is computed just once and then accessed. Use
const to make this clearer, even though the const fields of struct
kvm_mmu_role_regs already prevent (or make it harder...) to modify
the contents of the struct.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The role.base.smm flag is always zero when setting up shadow EPT,
do not bother copying it over from vcpu->arch.root_mmu.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Clear enable_mmio_caching if hardware can't support MMIO caching and use
the dedicated flag to detect if MMIO caching is enabled instead of
assuming shadow_mmio_value==0 means MMIO caching is disabled. TDX will
use a zero value even when caching is enabled, and is_mmio_spte() isn't
so hot that it needs to avoid an extra memory access, i.e. there's no
reason to be super clever. And the clever approach may not even be more
performant, e.g. gcc-11 lands the extra check on a non-zero value inline,
but puts the enable_mmio_caching out-of-line, i.e. avoids the few extra
uops for non-MMIO SPTEs.
Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220420002747.3287931-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When determining whether or not a SPTE needs to have SME/SEV's memory
encryption flag set, do the moderately expensive host MMIO pfn check if
and only if the memory encryption mask is non-zero.
Note, KVM could further optimize the host MMIO checks by making a single
call to kvm_is_mmio_pfn(), but the tdp_enabled path (for EPT's memtype
handling) will likely be split out to a separate flow[*]. At that point,
a better approach would be to shove the call to kvm_is_mmio_pfn() into
VMX code so that AMD+NPT without SME doesn't get hit with an unnecessary
lookup.
[*] https://lkml.kernel.org/r/20220321224358.1305530-3-bgardon@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220415004909.2216670-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes for (relatively) old bugs, to be merged in both the -rc and next
development trees.
The merge reconciles the ABI fixes for KVM_EXIT_SYSTEM_EVENT between
5.18 and commit c24a950ec7d6 ("KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata
for SEV-ES", 2022-04-13).
KVM uses lookup_address_in_mm() to detect the hugepage size that the host
uses to map a pfn. The function suffers from several issues:
- no usage of READ_ONCE(*). This allows multiple dereference of the same
page table entry. The TOCTOU problem because of that may cause KVM to
incorrectly treat a newly generated leaf entry as a nonleaf one, and
dereference the content by using its pfn value.
- the information returned does not match what KVM needs; for non-present
entries it returns the level at which the walk was terminated, as long
as the entry is not 'none'. KVM needs level information of only 'present'
entries, otherwise it may regard a non-present PXE entry as a present
large page mapping.
- the function is not safe for mappings that can be torn down, because it
does not disable IRQs and because it returns a PTE pointer which is never
safe to dereference after the function returns.
So implement the logic for walking host page tables directly in KVM, and
stop using lookup_address_in_mm().
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220429031757.2042406-1-mizhang@google.com>
[Inline in host_pfn_mapping_level, ensure no semantic change for its
callers. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):
WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2d0 [kvm]
kvm_vm_release+0x1d/0x30 [kvm]
__fput+0x82/0x240
task_work_run+0x5b/0x90
exit_to_user_mode_prepare+0xd2/0xe0
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.
Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.
A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).
The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the recently introduced __try_cmpxchg_user() to update guest PTE A/D
bits instead of mapping the PTE into kernel address space. The VM_PFNMAP
path is broken as it assumes that vm_pgoff is the base pfn of the mapped
VMA range, which is conceptually wrong as vm_pgoff is the offset relative
to the file and has nothing to do with the pfn. The horrific hack worked
for the original use case (backing guest memory with /dev/mem), but leads
to accessing "random" pfns for pretty much any other VM_PFNMAP case.
Fixes: bd53cb35a3e9 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs")
Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220202004945.2540433-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Derive the mask of RWX bits reported on EPT violations from the mask of
RWX bits that are shoved into EPT entries; the layout is the same, the
EPT violation bits are simply shifted by three. Use the new shift and a
slight copy-paste of the mask derivation instead of completely open
coding the same to convert between the EPT entry bits and the exit
qualification when synthesizing a nested EPT Violation.
No functional change intended.
Cc: SU Hang <darcy.sh@antgroup.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220329030108.97341-3-darcy.sh@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using self-expressing macro definition EPT_VIOLATION_GVA_VALIDATION
and EPT_VIOLATION_GVA_TRANSLATED instead of 0x180
in FNAME(walk_addr_generic)().
Signed-off-by: SU Hang <darcy.sh@antgroup.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220329030108.97341-2-darcy.sh@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge branch for features that did not make it into 5.18:
* New ioctls to get/set TSC frequency for a whole VM
* Allow userspace to opt out of hypercall patching
Nested virtualization improvements for AMD:
* Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE,
nested vGIF)
* Allow AVIC to co-exist with a nested guest running
* Fixes for LBR virtualizations when a nested guest is running,
and nested LBR virtualization support
* PAUSE filtering for nested hypervisors
Guest support:
* Decoupling of vcpu_is_preempted from PV spinlocks
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All work currently pending will be done first by calling destroy_workqueue,
so there is unnecessary to flush it explicitly.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220401083530.2407703-1-lv.ruyi@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
-1 is technically undefined behavior when its value is read out by
param_get_bool(), as boolean values are supposed to be '0' or '1'.
Alternatively, KVM could define a custom getter for the param, but the
auto value doesn't depend on the vendor module in any way, and printing
"auto" would be unnecessarily unfriendly to the user.
In addition to fixing the undefined behavior, resolving the auto value
also fixes the scenario where the auto value resolves to N and no vendor
module is loaded. Previously, -1 would result in Y being printed even
though KVM would ultimately disable the mitigation.
Rename the existing MMU module init/exit helpers to clarify that they're
invoked with respect to the vendor module, and add comments to document
why KVM has two separate "module init" flows.
=========================================================================
UBSAN: invalid-load in kernel/params.c:320:33
load of value 255 is not a valid value for type '_Bool'
CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
ubsan_epilogue+0x5/0x40
__ubsan_handle_load_invalid_value.cold+0x43/0x48
param_get_bool.cold+0xf/0x14
param_attr_show+0x55/0x80
module_attr_show+0x1c/0x30
sysfs_kf_seq_show+0x93/0xc0
seq_read_iter+0x11c/0x450
new_sync_read+0x11b/0x1a0
vfs_read+0xf0/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
=========================================================================
Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: stable@vger.kernel.org
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220331221359.3912754-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Before Commit c3e5e415bc1e6 ("KVM: X86: Change kvm_sync_page()
to return true when remote flush is needed"), the return value
of kvm_sync_page() indicates whether the page is synced, and
kvm_mmu_get_page() would rebuild page when the sync fails.
But now, kvm_sync_page() returns false when the page is
synced and no tlb flushing is required, which leads to
rebuild page in kvm_mmu_get_page(). So return the return
value of mmu->sync_page() directly and check it in
kvm_mmu_get_page(). If the sync fails, the page will be
zapped and the invalid_list is not empty, so set flush as
true is accepted in mmu_sync_children().
Cc: stable@vger.kernel.org
Fixes: c3e5e415bc1e6 ("KVM: X86: Change kvm_sync_page() to return true when remote flush is needed")
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Acked-by: Lai Jiangshan <jiangshanlai@gmail.com>
Message-Id: <0dabeeb789f57b0d793f85d073893063e692032d.1647336064.git.houwenlong.hwl@antgroup.com>
[mmu_sync_children should not flush if the page is zapped. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
FNAME(cmpxchg_gpte) is an inefficient mess. It is at least decent if it
can go through get_user_pages_fast(), but if it cannot then it tries to
use memremap(); that is not just terribly slow, it is also wrong because
it assumes that the VM_PFNMAP VMA is contiguous.
The right way to do it would be to do the same thing as
hva_to_pfn_remapped() does since commit add6a0cd1c5b ("KVM: MMU: try to
fix up page faults before giving up", 2016-07-05), using follow_pte()
and fixup_user_fault() to determine the correct address to use for
memremap(). To do this, one could for example extract hva_to_pfn()
for use outside virt/kvm/kvm_main.c. But really there is no reason to
do that either, because there is already a perfectly valid address to
do the cmpxchg() on, only it is a userspace address. That means doing
user_access_begin()/user_access_end() and writing the code in assembly
to handle exceptions correctly. Worse, the guest PTE can be 8-byte
even on i686 so there is the extra complication of using cmpxchg8b to
account for. But at least it is an efficient mess.
(Thanks to Linus for suggesting improvement on the inline assembly).
Reported-by: Qiuhao Li <qiuhao@sysec.org>
Reported-by: Gaoning Pan <pgn@zju.edu.cn>
Reported-by: Yongkang Jia <kangel@zju.edu.cn>
Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com
Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Fixes: bd53cb35a3e9 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It makes more sense to print new SPTE value than the
old value.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220302102457.588450-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are two kinds of implicit supervisor access
implicit supervisor access when CPL = 3
implicit supervisor access when CPL < 3
Current permission_fault() handles only the first kind for SMAP.
But if the access is implicit when SMAP is on, data may not be read
nor write from any user-mode address regardless the current CPL.
So the second kind should be also supported.
The first kind can be detect via CPL and access mode: if it is
supervisor access and CPL = 3, it must be implicit supervisor access.
But it is not possible to detect the second kind without extra
information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
into @access. This extra information also works for the first kind, so
the logic is changed to use this information for both cases.
The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
which is in the most significant 16 bits of u64 and less likely to be
forced to change due to future hardware uses it.
This patch removes the call to ->get_cpl() for access mode is determined
by @access. Not only does it reduce a function call, but also remove
confusions when the permission is checked for nested TDP. The nested
TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
on it. The original code works just because it is always user walk for
NPT and SMAP fault is not set for EPT in update_permission_bitmask.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The commit 09f037aa48f3 ("KVM: MMU: speedup update_permission_bitmask")
refactored the code of update_permission_bitmask() and change the
comments. It added a condition into a list to match the new code,
so the number/order for conditions in the comments should be updated
too.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Change the type of access u32 to u64 for FNAME(walk_addr) and
->gva_to_gpa().
The kinds of accesses are usually combinations of UWX, and VMX/SVM's
nested paging adds a new factor of access: is it an access for a guest
page table or for a final guest physical address.
And SMAP relies a factor for supervisor access: explicit or implicit.
So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
all these information to do the walk.
Although @access(u32) has enough bits to encode all the kinds, this
patch extends it to u64:
o Extra bits will be in the higher 32 bits, so that we can
easily obtain the traditional access mode (UWX) by converting
it to u32.
o Reuse the value for the access kind defined by SVM's nested
paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
@error_code in kvm_handle_page_fault().
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
flush when processing multiple roots (including nested TDP shadow roots).
Dropping the TLB flush resulted in random crashes when running Hyper-V
Server 2019 in a guest with KSM enabled in the host (or any source of
mmu_notifier invalidations, KSM is just the easiest to force).
This effectively revert commits 873dd122172f8cce329113cfb0dfe3d2344d80c0
and fcb93eb6d09dd302cbef22bd95a5858af75e4156, and thus restores commit
cf3e26427c08ad9015956293ab389004ac6a338e, plus this delta on top:
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
- flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
return flush;
}
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220325230348.2587437-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
kvm_arch_init_vm also has to undo all the initialization, so
group all the MMU initialization code at the beginning and
handle cleaning up of kvm_page_track_init.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit cf3e26427c08ad9015956293ab389004ac6a338e.
Multi-vCPU Hyper-V guests started crashing randomly on boot with the
latest kvm/queue and the problem can be bisected the problem to this
particular patch. Basically, I'm not able to boot e.g. 16-vCPU guest
successfully anymore. Both Intel and AMD seem to be affected. Reverting
the commit saves the day.
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()"
is going to be reverted, it's not going to be true anymore that
the zap-page flow does not free any 'struct kvm_mmu_page'. Introduce
an early flush before tdp_mmu_zap_leafs() returns, to preserve
bisectability.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>