2f303b74a6
In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"), the kvm_lock was made a raw lock. However, the kvm mmu_shrink() function tries to grab the (non-raw) mmu_lock within the scope of the raw locked kvm_lock being held. This leads to the following: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0 Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm] Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt Call Trace: [<ffffffff8106f2ad>] __might_sleep+0xfd/0x160 [<ffffffff817d8d64>] rt_spin_lock+0x24/0x50 [<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm] [<ffffffff8111455d>] shrink_slab+0x17d/0x3a0 [<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260 [<ffffffff8111824a>] balance_pgdat+0x54a/0x730 [<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0 [<ffffffff811185bf>] kswapd+0x18f/0x490 [<ffffffff81070961>] ? get_parent_ip+0x11/0x50 [<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50 [<ffffffff81118430>] ? balance_pgdat+0x730/0x730 [<ffffffff81060d2b>] kthread+0xdb/0xe0 [<ffffffff8106e122>] ? finish_task_switch+0x52/0x100 [<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10 [<ffffffff81060c50>] ? __init_kthread_worker+0x After the previous patch, kvm_lock need not be a raw spinlock anymore, so change it back. Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: kvm@vger.kernel.org Cc: gleb@redhat.com Cc: jan.kiszka@siemens.com Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
169 lines
5.4 KiB
Plaintext
169 lines
5.4 KiB
Plaintext
KVM Lock Overview
|
|
=================
|
|
|
|
1. Acquisition Orders
|
|
---------------------
|
|
|
|
(to be written)
|
|
|
|
2: Exception
|
|
------------
|
|
|
|
Fast page fault:
|
|
|
|
Fast page fault is the fast path which fixes the guest page fault out of
|
|
the mmu-lock on x86. Currently, the page fault can be fast only if the
|
|
shadow page table is present and it is caused by write-protect, that means
|
|
we just need change the W bit of the spte.
|
|
|
|
What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
|
|
SPTE_MMU_WRITEABLE bit on the spte:
|
|
- SPTE_HOST_WRITEABLE means the gfn is writable on host.
|
|
- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
|
|
the gfn is writable on guest mmu and it is not write-protected by shadow
|
|
page write-protection.
|
|
|
|
On fast page fault path, we will use cmpxchg to atomically set the spte W
|
|
bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this
|
|
is safe because whenever changing these bits can be detected by cmpxchg.
|
|
|
|
But we need carefully check these cases:
|
|
1): The mapping from gfn to pfn
|
|
The mapping from gfn to pfn may be changed since we can only ensure the pfn
|
|
is not changed during cmpxchg. This is a ABA problem, for example, below case
|
|
will happen:
|
|
|
|
At the beginning:
|
|
gpte = gfn1
|
|
gfn1 is mapped to pfn1 on host
|
|
spte is the shadow page table entry corresponding with gpte and
|
|
spte = pfn1
|
|
|
|
VCPU 0 VCPU0
|
|
on fast page fault path:
|
|
|
|
old_spte = *spte;
|
|
pfn1 is swapped out:
|
|
spte = 0;
|
|
|
|
pfn1 is re-alloced for gfn2.
|
|
|
|
gpte is changed to point to
|
|
gfn2 by the guest:
|
|
spte = pfn1;
|
|
|
|
if (cmpxchg(spte, old_spte, old_spte+W)
|
|
mark_page_dirty(vcpu->kvm, gfn1)
|
|
OOPS!!!
|
|
|
|
We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
|
|
|
|
For direct sp, we can easily avoid it since the spte of direct sp is fixed
|
|
to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
|
|
to pin gfn to pfn, because after gfn_to_pfn_atomic():
|
|
- We have held the refcount of pfn that means the pfn can not be freed and
|
|
be reused for another gfn.
|
|
- The pfn is writable that means it can not be shared between different gfns
|
|
by KSM.
|
|
|
|
Then, we can ensure the dirty bitmaps is correctly set for a gfn.
|
|
|
|
Currently, to simplify the whole things, we disable fast page fault for
|
|
indirect shadow page.
|
|
|
|
2): Dirty bit tracking
|
|
In the origin code, the spte can be fast updated (non-atomically) if the
|
|
spte is read-only and the Accessed bit has already been set since the
|
|
Accessed bit and Dirty bit can not be lost.
|
|
|
|
But it is not true after fast page fault since the spte can be marked
|
|
writable between reading spte and updating spte. Like below case:
|
|
|
|
At the beginning:
|
|
spte.W = 0
|
|
spte.Accessed = 1
|
|
|
|
VCPU 0 VCPU0
|
|
In mmu_spte_clear_track_bits():
|
|
|
|
old_spte = *spte;
|
|
|
|
/* 'if' condition is satisfied. */
|
|
if (old_spte.Accssed == 1 &&
|
|
old_spte.W == 0)
|
|
spte = 0ull;
|
|
on fast page fault path:
|
|
spte.W = 1
|
|
memory write on the spte:
|
|
spte.Dirty = 1
|
|
|
|
|
|
else
|
|
old_spte = xchg(spte, 0ull)
|
|
|
|
|
|
if (old_spte.Accssed == 1)
|
|
kvm_set_pfn_accessed(spte.pfn);
|
|
if (old_spte.Dirty == 1)
|
|
kvm_set_pfn_dirty(spte.pfn);
|
|
OOPS!!!
|
|
|
|
The Dirty bit is lost in this case.
|
|
|
|
In order to avoid this kind of issue, we always treat the spte as "volatile"
|
|
if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
|
|
the spte is always atomicly updated in this case.
|
|
|
|
3): flush tlbs due to spte updated
|
|
If the spte is updated from writable to readonly, we should flush all TLBs,
|
|
otherwise rmap_write_protect will find a read-only spte, even though the
|
|
writable spte might be cached on a CPU's TLB.
|
|
|
|
As mentioned before, the spte can be updated to writable out of mmu-lock on
|
|
fast page fault path, in order to easily audit the path, we see if TLBs need
|
|
be flushed caused by this reason in mmu_spte_update() since this is a common
|
|
function to update spte (present -> present).
|
|
|
|
Since the spte is "volatile" if it can be updated out of mmu-lock, we always
|
|
atomicly update the spte, the race caused by fast page fault can be avoided,
|
|
See the comments in spte_has_volatile_bits() and mmu_spte_update().
|
|
|
|
3. Reference
|
|
------------
|
|
|
|
Name: kvm_lock
|
|
Type: spinlock_t
|
|
Arch: any
|
|
Protects: - vm_list
|
|
|
|
Name: kvm_count_lock
|
|
Type: raw_spinlock_t
|
|
Arch: any
|
|
Protects: - hardware virtualization enable/disable
|
|
Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
|
|
migration.
|
|
|
|
Name: kvm_arch::tsc_write_lock
|
|
Type: raw_spinlock
|
|
Arch: x86
|
|
Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
|
|
- tsc offset in vmcb
|
|
Comment: 'raw' because updating the tsc offsets must not be preempted.
|
|
|
|
Name: kvm->mmu_lock
|
|
Type: spinlock_t
|
|
Arch: any
|
|
Protects: -shadow page/shadow tlb entry
|
|
Comment: it is a spinlock since it is used in mmu notifier.
|
|
|
|
Name: kvm->srcu
|
|
Type: srcu lock
|
|
Arch: any
|
|
Protects: - kvm->memslots
|
|
- kvm->buses
|
|
Comment: The srcu read lock must be held while accessing memslots (e.g.
|
|
when using gfn_to_* functions) and while accessing in-kernel
|
|
MMIO/PIO address->device structure mapping (kvm->buses).
|
|
The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
|
|
if it is needed by multiple functions.
|