Skip to content

Commit 75e7fcd

Browse files
mchehabbonzini
authored andcommitted
docs: kvm: Convert locking.txt to ReST format
- Use document title and chapter markups; - Add markups for literal blocks; - use :field: for field descriptions; - Add blank lines and adjust indentation. Signed-off-by: Mauro Carvalho Chehab <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
1 parent 5a0af48 commit 75e7fcd

File tree

3 files changed

+244
-215
lines changed

3 files changed

+244
-215
lines changed

Documentation/virt/kvm/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ KVM
1212
cpuid
1313
halt-polling
1414
hypercalls
15+
locking
1516
msr
1617
vcpu-requests
1718

Documentation/virt/kvm/locking.rst

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=================
4+
KVM Lock Overview
5+
=================
6+
7+
1. Acquisition Orders
8+
---------------------
9+
10+
The acquisition orders for mutexes are as follows:
11+
12+
- kvm->lock is taken outside vcpu->mutex
13+
14+
- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
15+
16+
- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
17+
them together is quite rare.
18+
19+
On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
20+
21+
Everything else is a leaf: no other lock is taken inside the critical
22+
sections.
23+
24+
2. Exception
25+
------------
26+
27+
Fast page fault:
28+
29+
Fast page fault is the fast path which fixes the guest page fault out of
30+
the mmu-lock on x86. Currently, the page fault can be fast in one of the
31+
following two cases:
32+
33+
1. Access Tracking: The SPTE is not present, but it is marked for access
34+
tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
35+
restore the saved R/X bits. This is described in more detail later below.
36+
37+
2. Write-Protection: The SPTE is present and the fault is
38+
caused by write-protect. That means we just need to change the W bit of
39+
the spte.
40+
41+
What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
42+
SPTE_MMU_WRITEABLE bit on the spte:
43+
44+
- SPTE_HOST_WRITEABLE means the gfn is writable on host.
45+
- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
46+
the gfn is writable on guest mmu and it is not write-protected by shadow
47+
page write-protection.
48+
49+
On fast page fault path, we will use cmpxchg to atomically set the spte W
50+
bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
51+
restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
52+
is safe because whenever changing these bits can be detected by cmpxchg.
53+
54+
But we need carefully check these cases:
55+
56+
1) The mapping from gfn to pfn
57+
58+
The mapping from gfn to pfn may be changed since we can only ensure the pfn
59+
is not changed during cmpxchg. This is a ABA problem, for example, below case
60+
will happen:
61+
62+
+------------------------------------------------------------------------+
63+
| At the beginning:: |
64+
| |
65+
| gpte = gfn1 |
66+
| gfn1 is mapped to pfn1 on host |
67+
| spte is the shadow page table entry corresponding with gpte and |
68+
| spte = pfn1 |
69+
+------------------------------------------------------------------------+
70+
| On fast page fault path: |
71+
+------------------------------------+-----------------------------------+
72+
| CPU 0: | CPU 1: |
73+
+------------------------------------+-----------------------------------+
74+
| :: | |
75+
| | |
76+
| old_spte = *spte; | |
77+
+------------------------------------+-----------------------------------+
78+
| | pfn1 is swapped out:: |
79+
| | |
80+
| | spte = 0; |
81+
| | |
82+
| | pfn1 is re-alloced for gfn2. |
83+
| | |
84+
| | gpte is changed to point to |
85+
| | gfn2 by the guest:: |
86+
| | |
87+
| | spte = pfn1; |
88+
+------------------------------------+-----------------------------------+
89+
| :: |
90+
| |
91+
| if (cmpxchg(spte, old_spte, old_spte+W) |
92+
| mark_page_dirty(vcpu->kvm, gfn1) |
93+
| OOPS!!! |
94+
+------------------------------------------------------------------------+
95+
96+
We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
97+
98+
For direct sp, we can easily avoid it since the spte of direct sp is fixed
99+
to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
100+
to pin gfn to pfn, because after gfn_to_pfn_atomic():
101+
102+
- We have held the refcount of pfn that means the pfn can not be freed and
103+
be reused for another gfn.
104+
- The pfn is writable that means it can not be shared between different gfns
105+
by KSM.
106+
107+
Then, we can ensure the dirty bitmaps is correctly set for a gfn.
108+
109+
Currently, to simplify the whole things, we disable fast page fault for
110+
indirect shadow page.
111+
112+
2) Dirty bit tracking
113+
114+
In the origin code, the spte can be fast updated (non-atomically) if the
115+
spte is read-only and the Accessed bit has already been set since the
116+
Accessed bit and Dirty bit can not be lost.
117+
118+
But it is not true after fast page fault since the spte can be marked
119+
writable between reading spte and updating spte. Like below case:
120+
121+
+------------------------------------------------------------------------+
122+
| At the beginning:: |
123+
| |
124+
| spte.W = 0 |
125+
| spte.Accessed = 1 |
126+
+------------------------------------+-----------------------------------+
127+
| CPU 0: | CPU 1: |
128+
+------------------------------------+-----------------------------------+
129+
| In mmu_spte_clear_track_bits():: | |
130+
| | |
131+
| old_spte = *spte; | |
132+
| | |
133+
| | |
134+
| /* 'if' condition is satisfied. */| |
135+
| if (old_spte.Accessed == 1 && | |
136+
| old_spte.W == 0) | |
137+
| spte = 0ull; | |
138+
+------------------------------------+-----------------------------------+
139+
| | on fast page fault path:: |
140+
| | |
141+
| | spte.W = 1 |
142+
| | |
143+
| | memory write on the spte:: |
144+
| | |
145+
| | spte.Dirty = 1 |
146+
+------------------------------------+-----------------------------------+
147+
| :: | |
148+
| | |
149+
| else | |
150+
| old_spte = xchg(spte, 0ull) | |
151+
| if (old_spte.Accessed == 1) | |
152+
| kvm_set_pfn_accessed(spte.pfn);| |
153+
| if (old_spte.Dirty == 1) | |
154+
| kvm_set_pfn_dirty(spte.pfn); | |
155+
| OOPS!!! | |
156+
+------------------------------------+-----------------------------------+
157+
158+
The Dirty bit is lost in this case.
159+
160+
In order to avoid this kind of issue, we always treat the spte as "volatile"
161+
if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
162+
the spte is always atomically updated in this case.
163+
164+
3) flush tlbs due to spte updated
165+
166+
If the spte is updated from writable to readonly, we should flush all TLBs,
167+
otherwise rmap_write_protect will find a read-only spte, even though the
168+
writable spte might be cached on a CPU's TLB.
169+
170+
As mentioned before, the spte can be updated to writable out of mmu-lock on
171+
fast page fault path, in order to easily audit the path, we see if TLBs need
172+
be flushed caused by this reason in mmu_spte_update() since this is a common
173+
function to update spte (present -> present).
174+
175+
Since the spte is "volatile" if it can be updated out of mmu-lock, we always
176+
atomically update the spte, the race caused by fast page fault can be avoided,
177+
See the comments in spte_has_volatile_bits() and mmu_spte_update().
178+
179+
Lockless Access Tracking:
180+
181+
This is used for Intel CPUs that are using EPT but do not support the EPT A/D
182+
bits. In this case, when the KVM MMU notifier is called to track accesses to a
183+
page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
184+
by clearing the RWX bits in the PTE and storing the original R & X bits in
185+
some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
186+
PTE (using the ignored bit 62). When the VM tries to access the page later on,
187+
a fault is generated and the fast page fault mechanism described above is used
188+
to atomically restore the PTE to a Present state. The W bit is not saved when
189+
the PTE is marked for access tracking and during restoration to the Present
190+
state, the W bit is set depending on whether or not it was a write access. If
191+
it wasn't, then the W bit will remain clear until a write access happens, at
192+
which time it will be set using the Dirty tracking mechanism described above.
193+
194+
3. Reference
195+
------------
196+
197+
:Name: kvm_lock
198+
:Type: mutex
199+
:Arch: any
200+
:Protects: - vm_list
201+
202+
:Name: kvm_count_lock
203+
:Type: raw_spinlock_t
204+
:Arch: any
205+
:Protects: - hardware virtualization enable/disable
206+
:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
207+
migration.
208+
209+
:Name: kvm_arch::tsc_write_lock
210+
:Type: raw_spinlock
211+
:Arch: x86
212+
:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
213+
- tsc offset in vmcb
214+
:Comment: 'raw' because updating the tsc offsets must not be preempted.
215+
216+
:Name: kvm->mmu_lock
217+
:Type: spinlock_t
218+
:Arch: any
219+
:Protects: -shadow page/shadow tlb entry
220+
:Comment: it is a spinlock since it is used in mmu notifier.
221+
222+
:Name: kvm->srcu
223+
:Type: srcu lock
224+
:Arch: any
225+
:Protects: - kvm->memslots
226+
- kvm->buses
227+
:Comment: The srcu read lock must be held while accessing memslots (e.g.
228+
when using gfn_to_* functions) and while accessing in-kernel
229+
MMIO/PIO address->device structure mapping (kvm->buses).
230+
The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
231+
if it is needed by multiple functions.
232+
233+
:Name: blocked_vcpu_on_cpu_lock
234+
:Type: spinlock_t
235+
:Arch: x86
236+
:Protects: blocked_vcpu_on_cpu
237+
:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
238+
When VT-d posted-interrupts is supported and the VM has assigned
239+
devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
240+
protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
241+
wakeup notification event since external interrupts from the
242+
assigned devices happens, we will find the vCPU on the list to
243+
wakeup.

0 commit comments

Comments
 (0)