|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +===================================== |
| 4 | +Intel Trust Domain Extensions (TDX) |
| 5 | +===================================== |
| 6 | + |
| 7 | +Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from |
| 8 | +the host and physical attacks by isolating the guest register state and by |
| 9 | +encrypting the guest memory. In TDX, a special module running in a special |
| 10 | +mode sits between the host and the guest and manages the guest/host |
| 11 | +separation. |
| 12 | + |
| 13 | +Since the host cannot directly access guest registers or memory, much |
| 14 | +normal functionality of a hypervisor must be moved into the guest. This is |
| 15 | +implemented using a Virtualization Exception (#VE) that is handled by the |
| 16 | +guest kernel. A #VE is handled entirely inside the guest kernel, but some |
| 17 | +require the hypervisor to be consulted. |
| 18 | + |
| 19 | +TDX includes new hypercall-like mechanisms for communicating from the |
| 20 | +guest to the hypervisor or the TDX module. |
| 21 | + |
| 22 | +New TDX Exceptions |
| 23 | +================== |
| 24 | + |
| 25 | +TDX guests behave differently from bare-metal and traditional VMX guests. |
| 26 | +In TDX guests, otherwise normal instructions or memory accesses can cause |
| 27 | +#VE or #GP exceptions. |
| 28 | + |
| 29 | +Instructions marked with an '*' conditionally cause exceptions. The |
| 30 | +details for these instructions are discussed below. |
| 31 | + |
| 32 | +Instruction-based #VE |
| 33 | +--------------------- |
| 34 | + |
| 35 | +- Port I/O (INS, OUTS, IN, OUT) |
| 36 | +- HLT |
| 37 | +- MONITOR, MWAIT |
| 38 | +- WBINVD, INVD |
| 39 | +- VMCALL |
| 40 | +- RDMSR*,WRMSR* |
| 41 | +- CPUID* |
| 42 | + |
| 43 | +Instruction-based #GP |
| 44 | +--------------------- |
| 45 | + |
| 46 | +- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH, |
| 47 | + VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON |
| 48 | +- ENCLS, ENCLU |
| 49 | +- GETSEC |
| 50 | +- RSM |
| 51 | +- ENQCMD |
| 52 | +- RDMSR*,WRMSR* |
| 53 | + |
| 54 | +RDMSR/WRMSR Behavior |
| 55 | +-------------------- |
| 56 | + |
| 57 | +MSR access behavior falls into three categories: |
| 58 | + |
| 59 | +- #GP generated |
| 60 | +- #VE generated |
| 61 | +- "Just works" |
| 62 | + |
| 63 | +In general, the #GP MSRs should not be used in guests. Their use likely |
| 64 | +indicates a bug in the guest. The guest may try to handle the #GP with a |
| 65 | +hypercall but it is unlikely to succeed. |
| 66 | + |
| 67 | +The #VE MSRs are typically able to be handled by the hypervisor. Guests |
| 68 | +can make a hypercall to the hypervisor to handle the #VE. |
| 69 | + |
| 70 | +The "just works" MSRs do not need any special guest handling. They might |
| 71 | +be implemented by directly passing through the MSR to the hardware or by |
| 72 | +trapping and handling in the TDX module. Other than possibly being slow, |
| 73 | +these MSRs appear to function just as they would on bare metal. |
| 74 | + |
| 75 | +CPUID Behavior |
| 76 | +-------------- |
| 77 | + |
| 78 | +For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID |
| 79 | +return values (in guest EAX/EBX/ECX/EDX) are configurable by the |
| 80 | +hypervisor. For such cases, the Intel TDX module architecture defines two |
| 81 | +virtualization types: |
| 82 | + |
| 83 | +- Bit fields for which the hypervisor controls the value seen by the guest |
| 84 | + TD. |
| 85 | + |
| 86 | +- Bit fields for which the hypervisor configures the value such that the |
| 87 | + guest TD either sees their native value or a value of 0. For these bit |
| 88 | + fields, the hypervisor can mask off the native values, but it can not |
| 89 | + turn *on* values. |
| 90 | + |
| 91 | +A #VE is generated for CPUID leaves and sub-leaves that the TDX module does |
| 92 | +not know how to handle. The guest kernel may ask the hypervisor for the |
| 93 | +value with a hypercall. |
| 94 | + |
| 95 | +#VE on Memory Accesses |
| 96 | +====================== |
| 97 | + |
| 98 | +There are essentially two classes of TDX memory: private and shared. |
| 99 | +Private memory receives full TDX protections. Its content is protected |
| 100 | +against access from the hypervisor. Shared memory is expected to be |
| 101 | +shared between guest and hypervisor and does not receive full TDX |
| 102 | +protections. |
| 103 | + |
| 104 | +A TD guest is in control of whether its memory accesses are treated as |
| 105 | +private or shared. It selects the behavior with a bit in its page table |
| 106 | +entries. This helps ensure that a guest does not place sensitive |
| 107 | +information in shared memory, exposing it to the untrusted hypervisor. |
| 108 | + |
| 109 | +#VE on Shared Memory |
| 110 | +-------------------- |
| 111 | + |
| 112 | +Access to shared mappings can cause a #VE. The hypervisor ultimately |
| 113 | +controls whether a shared memory access causes a #VE, so the guest must be |
| 114 | +careful to only reference shared pages it can safely handle a #VE. For |
| 115 | +instance, the guest should be careful not to access shared memory in the |
| 116 | +#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET). |
| 117 | + |
| 118 | +Shared mapping content is entirely controlled by the hypervisor. The guest |
| 119 | +should only use shared mappings for communicating with the hypervisor. |
| 120 | +Shared mappings must never be used for sensitive memory content like kernel |
| 121 | +stacks. A good rule of thumb is that hypervisor-shared memory should be |
| 122 | +treated the same as memory mapped to userspace. Both the hypervisor and |
| 123 | +userspace are completely untrusted. |
| 124 | + |
| 125 | +MMIO for virtual devices is implemented as shared memory. The guest must |
| 126 | +be careful not to access device MMIO regions unless it is also prepared to |
| 127 | +handle a #VE. |
| 128 | + |
| 129 | +#VE on Private Pages |
| 130 | +-------------------- |
| 131 | + |
| 132 | +An access to private mappings can also cause a #VE. Since all kernel |
| 133 | +memory is also private memory, the kernel might theoretically need to |
| 134 | +handle a #VE on arbitrary kernel memory accesses. This is not feasible, so |
| 135 | +TDX guests ensure that all guest memory has been "accepted" before memory |
| 136 | +is used by the kernel. |
| 137 | + |
| 138 | +A modest amount of memory (typically 512M) is pre-accepted by the firmware |
| 139 | +before the kernel runs to ensure that the kernel can start up without |
| 140 | +being subjected to a #VE. |
| 141 | + |
| 142 | +The hypervisor is permitted to unilaterally move accepted pages to a |
| 143 | +"blocked" state. However, if it does this, page access will not generate a |
| 144 | +#VE. It will, instead, cause a "TD Exit" where the hypervisor is required |
| 145 | +to handle the exception. |
| 146 | + |
| 147 | +Linux #VE handler |
| 148 | +================= |
| 149 | + |
| 150 | +Just like page faults or #GP's, #VE exceptions can be either handled or be |
| 151 | +fatal. Typically, an unhandled userspace #VE results in a SIGSEGV. |
| 152 | +An unhandled kernel #VE results in an oops. |
| 153 | + |
| 154 | +Handling nested exceptions on x86 is typically nasty business. A #VE |
| 155 | +could be interrupted by an NMI which triggers another #VE and hilarity |
| 156 | +ensues. The TDX #VE architecture anticipated this scenario and includes a |
| 157 | +feature to make it slightly less nasty. |
| 158 | + |
| 159 | +During #VE handling, the TDX module ensures that all interrupts (including |
| 160 | +NMIs) are blocked. The block remains in place until the guest makes a |
| 161 | +TDG.VP.VEINFO.GET TDCALL. This allows the guest to control when interrupts |
| 162 | +or a new #VE can be delivered. |
| 163 | + |
| 164 | +However, the guest kernel must still be careful to avoid potential |
| 165 | +#VE-triggering actions (discussed above) while this block is in place. |
| 166 | +While the block is in place, any #VE is elevated to a double fault (#DF) |
| 167 | +which is not recoverable. |
| 168 | + |
| 169 | +MMIO handling |
| 170 | +============= |
| 171 | + |
| 172 | +In non-TDX VMs, MMIO is usually implemented by giving a guest access to a |
| 173 | +mapping which will cause a VMEXIT on access, and then the hypervisor |
| 174 | +emulates the access. That is not possible in TDX guests because VMEXIT |
| 175 | +will expose the register state to the host. TDX guests don't trust the host |
| 176 | +and can't have their state exposed to the host. |
| 177 | + |
| 178 | +In TDX, MMIO regions typically trigger a #VE exception in the guest. The |
| 179 | +guest #VE handler then emulates the MMIO instruction inside the guest and |
| 180 | +converts it into a controlled TDCALL to the host, rather than exposing |
| 181 | +guest state to the host. |
| 182 | + |
| 183 | +MMIO addresses on x86 are just special physical addresses. They can |
| 184 | +theoretically be accessed with any instruction that accesses memory. |
| 185 | +However, the kernel instruction decoding method is limited. It is only |
| 186 | +designed to decode instructions like those generated by io.h macros. |
| 187 | + |
| 188 | +MMIO access via other means (like structure overlays) may result in an |
| 189 | +oops. |
| 190 | + |
| 191 | +Shared Memory Conversions |
| 192 | +========================= |
| 193 | + |
| 194 | +All TDX guest memory starts out as private at boot. This memory can not |
| 195 | +be accessed by the hypervisor. However, some kernel users like device |
| 196 | +drivers might have a need to share data with the hypervisor. To do this, |
| 197 | +memory must be converted between shared and private. This can be |
| 198 | +accomplished using some existing memory encryption helpers: |
| 199 | + |
| 200 | + * set_memory_decrypted() converts a range of pages to shared. |
| 201 | + * set_memory_encrypted() converts memory back to private. |
| 202 | + |
| 203 | +Device drivers are the primary user of shared memory, but there's no need |
| 204 | +to touch every driver. DMA buffers and ioremap() do the conversions |
| 205 | +automatically. |
| 206 | + |
| 207 | +TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is |
| 208 | +converted to shared on boot. |
| 209 | + |
| 210 | +For coherent DMA allocation, the DMA buffer gets converted on the |
| 211 | +allocation. Check force_dma_unencrypted() for details. |
| 212 | + |
| 213 | +References |
| 214 | +========== |
| 215 | + |
| 216 | +TDX reference material is collected here: |
| 217 | + |
| 218 | +https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html |
0 commit comments