1
+ .. SPDX-License-Identifier: GPL-2.0
2
+
3
+ ======================
1
4
The x86 kvm shadow mmu
2
5
======================
3
6
@@ -7,27 +10,37 @@ physical addresses to host physical addresses.
7
10
8
11
The mmu code attempts to satisfy the following requirements:
9
12
10
- - correctness: the guest should not be able to determine that it is running
13
+ - correctness:
14
+ the guest should not be able to determine that it is running
11
15
on an emulated mmu except for timing (we attempt to comply
12
16
with the specification, not emulate the characteristics of
13
17
a particular implementation such as tlb size)
14
- - security: the guest must not be able to touch host memory not assigned
18
+ - security:
19
+ the guest must not be able to touch host memory not assigned
15
20
to it
16
- - performance: minimize the performance penalty imposed by the mmu
17
- - scaling: need to scale to large memory and large vcpu guests
18
- - hardware: support the full range of x86 virtualization hardware
19
- - integration: Linux memory management code must be in control of guest memory
21
+ - performance:
22
+ minimize the performance penalty imposed by the mmu
23
+ - scaling:
24
+ need to scale to large memory and large vcpu guests
25
+ - hardware:
26
+ support the full range of x86 virtualization hardware
27
+ - integration:
28
+ Linux memory management code must be in control of guest memory
20
29
so that swapping, page migration, page merging, transparent
21
30
hugepages, and similar features work without change
22
- - dirty tracking: report writes to guest memory to enable live migration
31
+ - dirty tracking:
32
+ report writes to guest memory to enable live migration
23
33
and framebuffer-based displays
24
- - footprint: keep the amount of pinned kernel memory low (most memory
34
+ - footprint:
35
+ keep the amount of pinned kernel memory low (most memory
25
36
should be shrinkable)
26
- - reliability: avoid multipage or GFP_ATOMIC allocations
37
+ - reliability:
38
+ avoid multipage or GFP_ATOMIC allocations
27
39
28
40
Acronyms
29
41
========
30
42
43
+ ==== ====================================================================
31
44
pfn host page frame number
32
45
hpa host physical address
33
46
hva host virtual address
@@ -41,6 +54,7 @@ pte page table entry (used also to refer generically to paging structure
41
54
gpte guest pte (referring to gfns)
42
55
spte shadow pte (referring to pfns)
43
56
tdp two dimensional paging (vendor neutral term for NPT and EPT)
57
+ ==== ====================================================================
44
58
45
59
Virtual and real hardware supported
46
60
===================================
@@ -90,11 +104,13 @@ Events
90
104
The mmu is driven by events, some from the guest, some from the host.
91
105
92
106
Guest generated events:
107
+
93
108
- writes to control registers (especially cr3)
94
109
- invlpg/invlpga instruction execution
95
110
- access to missing or protected translations
96
111
97
112
Host generated events:
113
+
98
114
- changes in the gpa->hpa translation (either through gpa->hva changes or
99
115
through hva->hpa changes)
100
116
- memory pressure (the shrinker)
@@ -117,16 +133,19 @@ Leaf ptes point at guest pages.
117
133
The following table shows translations encoded by leaf ptes, with higher-level
118
134
translations in parentheses:
119
135
120
- Non-nested guests:
136
+ Non-nested guests::
137
+
121
138
nonpaging: gpa->hpa
122
139
paging: gva->gpa->hpa
123
140
paging, tdp: (gva->)gpa->hpa
124
- Nested guests:
141
+
142
+ Nested guests::
143
+
125
144
non-tdp: ngva->gpa->hpa (*)
126
145
tdp: (ngva->)ngpa->gpa->hpa
127
146
128
- (*) the guest hypervisor will encode the ngva->gpa translation into its page
129
- tables if npt is not present
147
+ (*) the guest hypervisor will encode the ngva->gpa translation into its page
148
+ tables if npt is not present
130
149
131
150
Shadow pages contain the following information:
132
151
role.level:
@@ -291,28 +310,41 @@ Handling a page fault is performed as follows:
291
310
292
311
- if the RSV bit of the error code is set, the page fault is caused by guest
293
312
accessing MMIO and cached MMIO information is available.
313
+
294
314
- walk shadow page table
295
315
- check for valid generation number in the spte (see "Fast invalidation of
296
316
MMIO sptes" below)
297
317
- cache the information to vcpu->arch.mmio_gva, vcpu->arch.mmio_access and
298
318
vcpu->arch.mmio_gfn, and call the emulator
319
+
299
320
- If both P bit and R/W bit of error code are set, this could possibly
300
321
be handled as a "fast page fault" (fixed without taking the MMU lock). See
301
322
the description in Documentation/virt/kvm/locking.txt.
323
+
302
324
- if needed, walk the guest page tables to determine the guest translation
303
325
(gva->gpa or ngpa->gpa)
326
+
304
327
- if permissions are insufficient, reflect the fault back to the guest
328
+
305
329
- determine the host page
330
+
306
331
- if this is an mmio request, there is no host page; cache the info to
307
332
vcpu->arch.mmio_gva, vcpu->arch.mmio_access and vcpu->arch.mmio_gfn
333
+
308
334
- walk the shadow page table to find the spte for the translation,
309
335
instantiating missing intermediate page tables as necessary
336
+
310
337
- If this is an mmio request, cache the mmio info to the spte and set some
311
338
reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask)
339
+
312
340
- try to unsynchronize the page
341
+
313
342
- if successful, we can let the guest continue and modify the gpte
343
+
314
344
- emulate the instruction
345
+
315
346
- if failed, unshadow the page and let the guest continue
347
+
316
348
- update any translations that were modified by the instruction
317
349
318
350
invlpg handling:
@@ -324,10 +356,12 @@ invlpg handling:
324
356
Guest control register updates:
325
357
326
358
- mov to cr3
359
+
327
360
- look up new shadow roots
328
361
- synchronize newly reachable shadow pages
329
362
330
363
- mov to cr0/cr4/efer
364
+
331
365
- set up mmu context for new paging mode
332
366
- look up new shadow roots
333
367
- synchronize newly reachable shadow pages
@@ -358,6 +392,7 @@ on fault type:
358
392
(user write faults generate a #PF)
359
393
360
394
In the first case there are two additional complications:
395
+
361
396
- if CR4.SMEP is enabled: since we've turned the page into a kernel page,
362
397
the kernel may now execute it. We handle this by also setting spte.nx.
363
398
If we get a user fetch or read fault, we'll change spte.u=1 and
@@ -446,4 +481,3 @@ Further reading
446
481
447
482
- NPT presentation from KVM Forum 2008
448
483
http://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf
449
-
0 commit comments