2020-10-14 20:26:42 +02:00
// SPDX-License-Identifier: GPL-2.0
# include "mmu_internal.h"
# include "tdp_iter.h"
# include "spte.h"
/*
* Recalculates the pointer to the SPTE for the current GFN and level and
* reread the SPTE .
*/
static void tdp_iter_refresh_sptep ( struct tdp_iter * iter )
{
iter - > sptep = iter - > pt_path [ iter - > level - 1 ] +
2022-06-14 23:33:25 +00:00
SPTE_INDEX ( iter - > gfn < < PAGE_SHIFT , iter - > level ) ;
2022-02-26 00:15:28 +00:00
iter - > old_spte = kvm_tdp_mmu_read_spte ( iter - > sptep ) ;
2020-10-14 20:26:42 +02:00
}
static gfn_t round_gfn_for_level ( gfn_t gfn , int level )
{
return gfn & - KVM_PAGES_PER_HPAGE ( level ) ;
}
2021-03-15 16:38:02 -07:00
/*
* Return the TDP iterator to the root PT and allow it to continue its
* traversal over the paging structure from there .
*/
void tdp_iter_restart ( struct tdp_iter * iter )
{
KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator. Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.
When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.
Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.
Failing to zap a top-level SPTE manifests in one of two ways. If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.
WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2a0 [kvm]
kvm_vcpu_release+0x34/0x60 [kvm]
__fput+0x82/0x240
task_work_run+0x5c/0x90
do_exit+0x364/0xa10
? futex_unqueue+0x38/0x60
do_group_exit+0x33/0xa0
get_signal+0x155/0x850
arch_do_signal_or_restart+0xed/0x750
exit_to_user_mode_prepare+0xc5/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list. This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.
WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
Call Trace:
<TASK>
kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
__handle_changed_spte+0x92e/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
zap_gfn_range+0x549/0x620 [kvm]
kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
mmu_free_root_page+0x219/0x2c0 [kvm]
kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
kvm_mmu_unload+0x1c/0xa0 [kvm]
kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
kvm_put_kvm+0x3b1/0x8b0 [kvm]
kvm_vcpu_release+0x4e/0x70 [kvm]
__fput+0x1f7/0x8c0
task_work_run+0xf8/0x1a0
do_exit+0x97b/0x2230
do_group_exit+0xda/0x2a0
get_signal+0x3be/0x1e50
arch_do_signal_or_restart+0x244/0x17f0
exit_to_user_mode_prepare+0xcb/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry. But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.
Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label. The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.
Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 03:35:28 +00:00
iter - > yielded = false ;
2021-03-15 16:38:02 -07:00
iter - > yielded_gfn = iter - > next_last_level_gfn ;
iter - > level = iter - > root_level ;
iter - > gfn = round_gfn_for_level ( iter - > next_last_level_gfn , iter - > level ) ;
tdp_iter_refresh_sptep ( iter ) ;
iter - > valid = true ;
}
2020-10-14 20:26:42 +02:00
/*
* Sets a TDP iterator to walk a pre - order traversal of the paging structure
2021-02-02 10:57:18 -08:00
* rooted at root_pt , starting with the walk to translate next_last_level_gfn .
2020-10-14 20:26:42 +02:00
*/
2022-01-19 23:07:32 +00:00
void tdp_iter_start ( struct tdp_iter * iter , struct kvm_mmu_page * root ,
2021-02-02 10:57:18 -08:00
int min_level , gfn_t next_last_level_gfn )
2020-10-14 20:26:42 +02:00
{
2022-01-19 23:07:32 +00:00
int root_level = root - > role . level ;
2020-10-14 20:26:42 +02:00
WARN_ON ( root_level < 1 ) ;
WARN_ON ( root_level > PT64_ROOT_MAX_LEVEL ) ;
2021-02-02 10:57:18 -08:00
iter - > next_last_level_gfn = next_last_level_gfn ;
2020-10-14 20:26:42 +02:00
iter - > root_level = root_level ;
iter - > min_level = min_level ;
2022-01-19 23:07:32 +00:00
iter - > pt_path [ iter - > root_level - 1 ] = ( tdp_ptep_t ) root - > spt ;
iter - > as_id = kvm_mmu_page_as_id ( root ) ;
2020-10-14 20:26:42 +02:00
2021-03-15 16:38:02 -07:00
tdp_iter_restart ( iter ) ;
2020-10-14 20:26:42 +02:00
}
/*
* Given an SPTE and its level , returns a pointer containing the host virtual
* address of the child page table referenced by the SPTE . Returns null if
* there is no such entry .
*/
2021-02-02 10:57:23 -08:00
tdp_ptep_t spte_to_child_pt ( u64 spte , int level )
2020-10-14 20:26:42 +02:00
{
/*
* There ' s no child entry if this entry isn ' t present or is a
* last - level entry .
*/
if ( ! is_shadow_present_pte ( spte ) | | is_last_spte ( spte , level ) )
return NULL ;
2021-02-02 10:57:23 -08:00
return ( tdp_ptep_t ) __va ( spte_to_pfn ( spte ) < < PAGE_SHIFT ) ;
2020-10-14 20:26:42 +02:00
}
/*
* Steps down one level in the paging structure towards the goal GFN . Returns
* true if the iterator was able to step down a level , false otherwise .
*/
static bool try_step_down ( struct tdp_iter * iter )
{
2021-02-02 10:57:23 -08:00
tdp_ptep_t child_pt ;
2020-10-14 20:26:42 +02:00
if ( iter - > level = = iter - > min_level )
return false ;
/*
* Reread the SPTE before stepping down to avoid traversing into page
* tables that are no longer linked from this entry .
*/
2022-02-26 00:15:28 +00:00
iter - > old_spte = kvm_tdp_mmu_read_spte ( iter - > sptep ) ;
2020-10-14 20:26:42 +02:00
child_pt = spte_to_child_pt ( iter - > old_spte , iter - > level ) ;
if ( ! child_pt )
return false ;
iter - > level - - ;
iter - > pt_path [ iter - > level - 1 ] = child_pt ;
2021-02-02 10:57:18 -08:00
iter - > gfn = round_gfn_for_level ( iter - > next_last_level_gfn , iter - > level ) ;
2020-10-14 20:26:42 +02:00
tdp_iter_refresh_sptep ( iter ) ;
return true ;
}
/*
* Steps to the next entry in the current page table , at the current page table
* level . The next entry could point to a page backing guest memory or another
* page table , or it could be non - present . Returns true if the iterator was
* able to step to the next entry in the page table , false if the iterator was
* already at the end of the current page table .
*/
static bool try_step_side ( struct tdp_iter * iter )
{
/*
* Check if the iterator is already at the end of the current page
* table .
*/
2022-06-14 23:33:25 +00:00
if ( SPTE_INDEX ( iter - > gfn < < PAGE_SHIFT , iter - > level ) = =
( SPTE_ENT_PER_PAGE - 1 ) )
2020-10-14 20:26:42 +02:00
return false ;
iter - > gfn + = KVM_PAGES_PER_HPAGE ( iter - > level ) ;
2021-02-02 10:57:18 -08:00
iter - > next_last_level_gfn = iter - > gfn ;
2020-10-14 20:26:42 +02:00
iter - > sptep + + ;
2022-02-26 00:15:28 +00:00
iter - > old_spte = kvm_tdp_mmu_read_spte ( iter - > sptep ) ;
2020-10-14 20:26:42 +02:00
return true ;
}
/*
* Tries to traverse back up a level in the paging structure so that the walk
* can continue from the next entry in the parent page table . Returns true on a
* successful step up , false if already in the root page .
*/
static bool try_step_up ( struct tdp_iter * iter )
{
if ( iter - > level = = iter - > root_level )
return false ;
iter - > level + + ;
iter - > gfn = round_gfn_for_level ( iter - > gfn , iter - > level ) ;
tdp_iter_refresh_sptep ( iter ) ;
return true ;
}
/*
* Step to the next SPTE in a pre - order traversal of the paging structure .
* To get to the next SPTE , the iterator either steps down towards the goal
* GFN , if at a present , non - last - level SPTE , or over to a SPTE mapping a
* highter GFN .
*
* The basic algorithm is as follows :
* 1. If the current SPTE is a non - last - level SPTE , step down into the page
* table it points to .
* 2. If the iterator cannot step down , it will try to step to the next SPTE
* in the current page of the paging structure .
* 3. If the iterator cannot step to the next entry in the current page , it will
* try to step up to the parent paging structure page . In this case , that
* SPTE will have already been visited , and so the iterator must also step
* to the side again .
*/
void tdp_iter_next ( struct tdp_iter * iter )
{
KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator. Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.
When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.
Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.
Failing to zap a top-level SPTE manifests in one of two ways. If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.
WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2a0 [kvm]
kvm_vcpu_release+0x34/0x60 [kvm]
__fput+0x82/0x240
task_work_run+0x5c/0x90
do_exit+0x364/0xa10
? futex_unqueue+0x38/0x60
do_group_exit+0x33/0xa0
get_signal+0x155/0x850
arch_do_signal_or_restart+0xed/0x750
exit_to_user_mode_prepare+0xc5/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list. This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.
WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
Call Trace:
<TASK>
kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
__handle_changed_spte+0x92e/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
zap_gfn_range+0x549/0x620 [kvm]
kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
mmu_free_root_page+0x219/0x2c0 [kvm]
kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
kvm_mmu_unload+0x1c/0xa0 [kvm]
kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
kvm_put_kvm+0x3b1/0x8b0 [kvm]
kvm_vcpu_release+0x4e/0x70 [kvm]
__fput+0x1f7/0x8c0
task_work_run+0xf8/0x1a0
do_exit+0x97b/0x2230
do_group_exit+0xda/0x2a0
get_signal+0x3be/0x1e50
arch_do_signal_or_restart+0x244/0x17f0
exit_to_user_mode_prepare+0xcb/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry. But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.
Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label. The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.
Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 03:35:28 +00:00
if ( iter - > yielded ) {
tdp_iter_restart ( iter ) ;
return ;
}
2020-10-14 20:26:42 +02:00
if ( try_step_down ( iter ) )
return ;
do {
if ( try_step_side ( iter ) )
return ;
} while ( try_step_up ( iter ) ) ;
iter - > valid = false ;
}