|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +================= |
| 4 | +KVM Lock Overview |
| 5 | +================= |
| 6 | + |
| 7 | +1. Acquisition Orders |
| 8 | +--------------------- |
| 9 | + |
| 10 | +The acquisition orders for mutexes are as follows: |
| 11 | + |
| 12 | +- kvm->lock is taken outside vcpu->mutex |
| 13 | + |
| 14 | +- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock |
| 15 | + |
| 16 | +- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring |
| 17 | + them together is quite rare. |
| 18 | + |
| 19 | +On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock. |
| 20 | + |
| 21 | +Everything else is a leaf: no other lock is taken inside the critical |
| 22 | +sections. |
| 23 | + |
| 24 | +2. Exception |
| 25 | +------------ |
| 26 | + |
| 27 | +Fast page fault: |
| 28 | + |
| 29 | +Fast page fault is the fast path which fixes the guest page fault out of |
| 30 | +the mmu-lock on x86. Currently, the page fault can be fast in one of the |
| 31 | +following two cases: |
| 32 | + |
| 33 | +1. Access Tracking: The SPTE is not present, but it is marked for access |
| 34 | + tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to |
| 35 | + restore the saved R/X bits. This is described in more detail later below. |
| 36 | + |
| 37 | +2. Write-Protection: The SPTE is present and the fault is |
| 38 | + caused by write-protect. That means we just need to change the W bit of |
| 39 | + the spte. |
| 40 | + |
| 41 | +What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and |
| 42 | +SPTE_MMU_WRITEABLE bit on the spte: |
| 43 | + |
| 44 | +- SPTE_HOST_WRITEABLE means the gfn is writable on host. |
| 45 | +- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when |
| 46 | + the gfn is writable on guest mmu and it is not write-protected by shadow |
| 47 | + page write-protection. |
| 48 | + |
| 49 | +On fast page fault path, we will use cmpxchg to atomically set the spte W |
| 50 | +bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or |
| 51 | +restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This |
| 52 | +is safe because whenever changing these bits can be detected by cmpxchg. |
| 53 | + |
| 54 | +But we need carefully check these cases: |
| 55 | + |
| 56 | +1) The mapping from gfn to pfn |
| 57 | + |
| 58 | +The mapping from gfn to pfn may be changed since we can only ensure the pfn |
| 59 | +is not changed during cmpxchg. This is a ABA problem, for example, below case |
| 60 | +will happen: |
| 61 | + |
| 62 | ++------------------------------------------------------------------------+ |
| 63 | +| At the beginning:: | |
| 64 | +| | |
| 65 | +| gpte = gfn1 | |
| 66 | +| gfn1 is mapped to pfn1 on host | |
| 67 | +| spte is the shadow page table entry corresponding with gpte and | |
| 68 | +| spte = pfn1 | |
| 69 | ++------------------------------------------------------------------------+ |
| 70 | +| On fast page fault path: | |
| 71 | ++------------------------------------+-----------------------------------+ |
| 72 | +| CPU 0: | CPU 1: | |
| 73 | ++------------------------------------+-----------------------------------+ |
| 74 | +| :: | | |
| 75 | +| | | |
| 76 | +| old_spte = *spte; | | |
| 77 | ++------------------------------------+-----------------------------------+ |
| 78 | +| | pfn1 is swapped out:: | |
| 79 | +| | | |
| 80 | +| | spte = 0; | |
| 81 | +| | | |
| 82 | +| | pfn1 is re-alloced for gfn2. | |
| 83 | +| | | |
| 84 | +| | gpte is changed to point to | |
| 85 | +| | gfn2 by the guest:: | |
| 86 | +| | | |
| 87 | +| | spte = pfn1; | |
| 88 | ++------------------------------------+-----------------------------------+ |
| 89 | +| :: | |
| 90 | +| | |
| 91 | +| if (cmpxchg(spte, old_spte, old_spte+W) | |
| 92 | +| mark_page_dirty(vcpu->kvm, gfn1) | |
| 93 | +| OOPS!!! | |
| 94 | ++------------------------------------------------------------------------+ |
| 95 | + |
| 96 | +We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. |
| 97 | + |
| 98 | +For direct sp, we can easily avoid it since the spte of direct sp is fixed |
| 99 | +to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic() |
| 100 | +to pin gfn to pfn, because after gfn_to_pfn_atomic(): |
| 101 | + |
| 102 | +- We have held the refcount of pfn that means the pfn can not be freed and |
| 103 | + be reused for another gfn. |
| 104 | +- The pfn is writable that means it can not be shared between different gfns |
| 105 | + by KSM. |
| 106 | + |
| 107 | +Then, we can ensure the dirty bitmaps is correctly set for a gfn. |
| 108 | + |
| 109 | +Currently, to simplify the whole things, we disable fast page fault for |
| 110 | +indirect shadow page. |
| 111 | + |
| 112 | +2) Dirty bit tracking |
| 113 | + |
| 114 | +In the origin code, the spte can be fast updated (non-atomically) if the |
| 115 | +spte is read-only and the Accessed bit has already been set since the |
| 116 | +Accessed bit and Dirty bit can not be lost. |
| 117 | + |
| 118 | +But it is not true after fast page fault since the spte can be marked |
| 119 | +writable between reading spte and updating spte. Like below case: |
| 120 | + |
| 121 | ++------------------------------------------------------------------------+ |
| 122 | +| At the beginning:: | |
| 123 | +| | |
| 124 | +| spte.W = 0 | |
| 125 | +| spte.Accessed = 1 | |
| 126 | ++------------------------------------+-----------------------------------+ |
| 127 | +| CPU 0: | CPU 1: | |
| 128 | ++------------------------------------+-----------------------------------+ |
| 129 | +| In mmu_spte_clear_track_bits():: | | |
| 130 | +| | | |
| 131 | +| old_spte = *spte; | | |
| 132 | +| | | |
| 133 | +| | | |
| 134 | +| /* 'if' condition is satisfied. */| | |
| 135 | +| if (old_spte.Accessed == 1 && | | |
| 136 | +| old_spte.W == 0) | | |
| 137 | +| spte = 0ull; | | |
| 138 | ++------------------------------------+-----------------------------------+ |
| 139 | +| | on fast page fault path:: | |
| 140 | +| | | |
| 141 | +| | spte.W = 1 | |
| 142 | +| | | |
| 143 | +| | memory write on the spte:: | |
| 144 | +| | | |
| 145 | +| | spte.Dirty = 1 | |
| 146 | ++------------------------------------+-----------------------------------+ |
| 147 | +| :: | | |
| 148 | +| | | |
| 149 | +| else | | |
| 150 | +| old_spte = xchg(spte, 0ull) | | |
| 151 | +| if (old_spte.Accessed == 1) | | |
| 152 | +| kvm_set_pfn_accessed(spte.pfn);| | |
| 153 | +| if (old_spte.Dirty == 1) | | |
| 154 | +| kvm_set_pfn_dirty(spte.pfn); | | |
| 155 | +| OOPS!!! | | |
| 156 | ++------------------------------------+-----------------------------------+ |
| 157 | + |
| 158 | +The Dirty bit is lost in this case. |
| 159 | + |
| 160 | +In order to avoid this kind of issue, we always treat the spte as "volatile" |
| 161 | +if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, |
| 162 | +the spte is always atomically updated in this case. |
| 163 | + |
| 164 | +3) flush tlbs due to spte updated |
| 165 | + |
| 166 | +If the spte is updated from writable to readonly, we should flush all TLBs, |
| 167 | +otherwise rmap_write_protect will find a read-only spte, even though the |
| 168 | +writable spte might be cached on a CPU's TLB. |
| 169 | + |
| 170 | +As mentioned before, the spte can be updated to writable out of mmu-lock on |
| 171 | +fast page fault path, in order to easily audit the path, we see if TLBs need |
| 172 | +be flushed caused by this reason in mmu_spte_update() since this is a common |
| 173 | +function to update spte (present -> present). |
| 174 | + |
| 175 | +Since the spte is "volatile" if it can be updated out of mmu-lock, we always |
| 176 | +atomically update the spte, the race caused by fast page fault can be avoided, |
| 177 | +See the comments in spte_has_volatile_bits() and mmu_spte_update(). |
| 178 | + |
| 179 | +Lockless Access Tracking: |
| 180 | + |
| 181 | +This is used for Intel CPUs that are using EPT but do not support the EPT A/D |
| 182 | +bits. In this case, when the KVM MMU notifier is called to track accesses to a |
| 183 | +page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present |
| 184 | +by clearing the RWX bits in the PTE and storing the original R & X bits in |
| 185 | +some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the |
| 186 | +PTE (using the ignored bit 62). When the VM tries to access the page later on, |
| 187 | +a fault is generated and the fast page fault mechanism described above is used |
| 188 | +to atomically restore the PTE to a Present state. The W bit is not saved when |
| 189 | +the PTE is marked for access tracking and during restoration to the Present |
| 190 | +state, the W bit is set depending on whether or not it was a write access. If |
| 191 | +it wasn't, then the W bit will remain clear until a write access happens, at |
| 192 | +which time it will be set using the Dirty tracking mechanism described above. |
| 193 | + |
| 194 | +3. Reference |
| 195 | +------------ |
| 196 | + |
| 197 | +:Name: kvm_lock |
| 198 | +:Type: mutex |
| 199 | +:Arch: any |
| 200 | +:Protects: - vm_list |
| 201 | + |
| 202 | +:Name: kvm_count_lock |
| 203 | +:Type: raw_spinlock_t |
| 204 | +:Arch: any |
| 205 | +:Protects: - hardware virtualization enable/disable |
| 206 | +:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt |
| 207 | + migration. |
| 208 | + |
| 209 | +:Name: kvm_arch::tsc_write_lock |
| 210 | +:Type: raw_spinlock |
| 211 | +:Arch: x86 |
| 212 | +:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} |
| 213 | + - tsc offset in vmcb |
| 214 | +:Comment: 'raw' because updating the tsc offsets must not be preempted. |
| 215 | + |
| 216 | +:Name: kvm->mmu_lock |
| 217 | +:Type: spinlock_t |
| 218 | +:Arch: any |
| 219 | +:Protects: -shadow page/shadow tlb entry |
| 220 | +:Comment: it is a spinlock since it is used in mmu notifier. |
| 221 | + |
| 222 | +:Name: kvm->srcu |
| 223 | +:Type: srcu lock |
| 224 | +:Arch: any |
| 225 | +:Protects: - kvm->memslots |
| 226 | + - kvm->buses |
| 227 | +:Comment: The srcu read lock must be held while accessing memslots (e.g. |
| 228 | + when using gfn_to_* functions) and while accessing in-kernel |
| 229 | + MMIO/PIO address->device structure mapping (kvm->buses). |
| 230 | + The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu |
| 231 | + if it is needed by multiple functions. |
| 232 | + |
| 233 | +:Name: blocked_vcpu_on_cpu_lock |
| 234 | +:Type: spinlock_t |
| 235 | +:Arch: x86 |
| 236 | +:Protects: blocked_vcpu_on_cpu |
| 237 | +:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. |
| 238 | + When VT-d posted-interrupts is supported and the VM has assigned |
| 239 | + devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu |
| 240 | + protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues |
| 241 | + wakeup notification event since external interrupts from the |
| 242 | + assigned devices happens, we will find the vCPU on the list to |
| 243 | + wakeup. |
0 commit comments