|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +Confidential Computing VMs |
| 4 | +========================== |
| 5 | +Hyper-V can create and run Linux guests that are Confidential Computing |
| 6 | +(CoCo) VMs. Such VMs cooperate with the physical processor to better protect |
| 7 | +the confidentiality and integrity of data in the VM's memory, even in the |
| 8 | +face of a hypervisor/VMM that has been compromised and may behave maliciously. |
| 9 | +CoCo VMs on Hyper-V share the generic CoCo VM threat model and security |
| 10 | +objectives described in Documentation/security/snp-tdx-threat-model.rst. Note |
| 11 | +that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or |
| 12 | +"isolation VMs". |
| 13 | + |
| 14 | +A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the |
| 15 | +following: |
| 16 | + |
| 17 | +* Physical hardware with a processor that supports CoCo VMs |
| 18 | + |
| 19 | +* The hardware runs a version of Windows/Hyper-V with support for CoCo VMs |
| 20 | + |
| 21 | +* The VM runs a version of Linux that supports being a CoCo VM |
| 22 | + |
| 23 | +The physical hardware requirements are as follows: |
| 24 | + |
| 25 | +* AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME, |
| 26 | + SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo |
| 27 | + VM on Hyper-V. |
| 28 | + |
| 29 | +* Intel processor with TDX |
| 30 | + |
| 31 | +To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V |
| 32 | +when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM, |
| 33 | +or vice versa, after it is created. |
| 34 | + |
| 35 | +Operational Modes |
| 36 | +----------------- |
| 37 | +Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is |
| 38 | +created and cannot be changed during the life of the VM. |
| 39 | + |
| 40 | +* Fully-enlightened mode. In this mode, the guest operating system is |
| 41 | + enlightened to understand and manage all aspects of running as a CoCo VM. |
| 42 | + |
| 43 | +* Paravisor mode. In this mode, a paravisor layer between the guest and the |
| 44 | + host provides some operations needed to run as a CoCo VM. The guest operating |
| 45 | + system can have fewer CoCo enlightenments than is required in the |
| 46 | + fully-enlightened case. |
| 47 | + |
| 48 | +Conceptually, fully-enlightened mode and paravisor mode may be treated as |
| 49 | +points on a spectrum spanning the degree of guest enlightenment needed to run |
| 50 | +as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full |
| 51 | +implementation of paravisor mode is the other end of the spectrum, where all |
| 52 | +aspects of running as a CoCo VM are handled by the paravisor, and a normal |
| 53 | +guest OS with no knowledge of memory encryption or other aspects of CoCo VMs |
| 54 | +can run successfully. However, the Hyper-V implementation of paravisor mode |
| 55 | +does not go this far, and is somewhere in the middle of the spectrum. Some |
| 56 | +aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS |
| 57 | +must be enlightened for other aspects. Unfortunately, there is no |
| 58 | +standardized enumeration of feature/functions that might be provided in the |
| 59 | +paravisor, and there is no standardized mechanism for a guest OS to query the |
| 60 | +paravisor for the feature/functions it provides. The understanding of what |
| 61 | +the paravisor provides is hard-coded in the guest OS. |
| 62 | + |
| 63 | +Paravisor mode has similarities to the `Coconut project`_, which aims to provide |
| 64 | +a limited paravisor to provide services to the guest such as a virtual TPM. |
| 65 | +However, the Hyper-V paravisor generally handles more aspects of CoCo VMs |
| 66 | +than is currently envisioned for Coconut, and so is further toward the "no |
| 67 | +guest enlightenments required" end of the spectrum. |
| 68 | + |
| 69 | +.. _Coconut project: https://github.com/coconut-svsm/svsm |
| 70 | + |
| 71 | +In the CoCo VM threat model, the paravisor is in the guest security domain |
| 72 | +and must be trusted by the guest OS. By implication, the hypervisor/VMM must |
| 73 | +protect itself against a potentially malicious paravisor just like it |
| 74 | +protects against a potentially malicious guest. |
| 75 | + |
| 76 | +The hardware architectural approach to fully-enlightened vs. paravisor mode |
| 77 | +varies depending on the underlying processor. |
| 78 | + |
| 79 | +* With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in |
| 80 | + VMPL 0 and has full control of the guest context. In paravisor mode, the |
| 81 | + guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor |
| 82 | + running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have. |
| 83 | + Certain operations require the guest to invoke the paravisor. Furthermore, in |
| 84 | + paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode |
| 85 | + as defined by the SEV-SNP architecture. This mode simplifies guest management |
| 86 | + of memory encryption when a paravisor is used. |
| 87 | + |
| 88 | +* With Intel TDX processor, in fully-enlightened mode the guest OS runs in an |
| 89 | + L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the |
| 90 | + L1 VM, and the guest OS runs in a nested L2 VM. |
| 91 | + |
| 92 | +Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This |
| 93 | +MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and |
| 94 | +whether a paravisor is being used. It is straightforward to build a single |
| 95 | +kernel image that can boot and run properly on either architecture, and in |
| 96 | +either mode. |
| 97 | + |
| 98 | +Paravisor Effects |
| 99 | +----------------- |
| 100 | +Running in paravisor mode affects the following areas of generic Linux kernel |
| 101 | +CoCo VM functionality: |
| 102 | + |
| 103 | +* Initial guest memory setup. When a new VM is created in paravisor mode, the |
| 104 | + paravisor runs first and sets up the guest physical memory as encrypted. The |
| 105 | + guest Linux does normal memory initialization, except for explicitly marking |
| 106 | + appropriate ranges as decrypted (shared). In paravisor mode, Linux does not |
| 107 | + perform the early boot memory setup steps that are particularly tricky with |
| 108 | + AMD SEV-SNP in fully-enlightened mode. |
| 109 | + |
| 110 | +* #VC/#VE exception handling. In paravisor mode, Hyper-V configures the guest |
| 111 | + CoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM, |
| 112 | + respectively, and not the guest Linux. Consequently, these exception handlers |
| 113 | + do not run in the guest Linux and are not a required enlightenment for a |
| 114 | + Linux guest in paravisor mode. |
| 115 | + |
| 116 | +* CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in the |
| 117 | + guest indicating that the VM is operating with the respective hardware |
| 118 | + support. While these CPUID flags are visible in fully-enlightened CoCo VMs, |
| 119 | + the paravisor filters out these flags and the guest Linux does not see them. |
| 120 | + Throughout the Linux kernel, explicitly testing these flags has mostly been |
| 121 | + eliminated in favor of the cc_platform_has() function, with the goal of |
| 122 | + abstracting the differences between SEV-SNP and TDX. But the |
| 123 | + cc_platform_has() abstraction also allows the Hyper-V paravisor configuration |
| 124 | + to selectively enable aspects of CoCo VM functionality even when the CPUID |
| 125 | + flags are not set. The exception is early boot memory setup on SEV-SNP, which |
| 126 | + tests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisor |
| 127 | + mode VM achieves the desired effect or not running SEV-SNP specific early |
| 128 | + boot memory setup. |
| 129 | + |
| 130 | +* Device emulation. In paravisor mode, the Hyper-V paravisor provides |
| 131 | + emulation of devices such as the IO-APIC and TPM. Because the emulation |
| 132 | + happens in the paravisor in the guest context (instead of the hypervisor/VMM |
| 133 | + context), MMIO accesses to these devices must be encrypted references instead |
| 134 | + of the decrypted references that would be used in a fully-enlightened CoCo |
| 135 | + VM. The __ioremap_caller() function has been enhanced to make a callback to |
| 136 | + check whether a particular address range should be treated as encrypted |
| 137 | + (private). See the "is_private_mmio" callback. |
| 138 | + |
| 139 | +* Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guest |
| 140 | + memory between encrypted and decrypted requires coordinating with the |
| 141 | + hypervisor/VMM. This is done via callbacks invoked from |
| 142 | + __set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP and |
| 143 | + TDX implementations of these callbacks are used. In paravisor mode, a Hyper-V |
| 144 | + specific set of callbacks is used. These callbacks invoke the paravisor so |
| 145 | + that the paravisor can coordinate the transitions and inform the hypervisor |
| 146 | + as necessary. See hv_vtom_init() where these callback are set up. |
| 147 | + |
| 148 | +* Interrupt injection. In fully enlightened mode, a malicious hypervisor |
| 149 | + could inject interrupts into the guest OS at times that violate x86/x64 |
| 150 | + architectural rules. For full protection, the guest OS should include |
| 151 | + enlightenments that use the interrupt injection management features provided |
| 152 | + by CoCo-capable processors. In paravisor mode, the paravisor mediates |
| 153 | + interrupt injection into the guest OS, and ensures that the guest OS only |
| 154 | + sees interrupts that are "legal". The paravisor uses the interrupt injection |
| 155 | + management features provided by the CoCo-capable physical processor, thereby |
| 156 | + masking these complexities from the guest OS. |
| 157 | + |
| 158 | +Hyper-V Hypercalls |
| 159 | +------------------ |
| 160 | +When in fully-enlightened mode, hypercalls made by the Linux guest are routed |
| 161 | +directly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode, |
| 162 | +normal hypercalls trap to the paravisor first, which may in turn invoke the |
| 163 | +hypervisor. But the paravisor is idiosyncratic in this regard, and a few |
| 164 | +hypercalls made by the Linux guest must always be routed directly to the |
| 165 | +hypervisor. These hypercall sites test for a paravisor being present, and use |
| 166 | +a special invocation sequence. See hv_post_message(), for example. |
| 167 | + |
| 168 | +Guest communication with Hyper-V |
| 169 | +-------------------------------- |
| 170 | +Separate from the generic Linux kernel handling of memory encryption in Linux |
| 171 | +CoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memory |
| 172 | +shared between the Linux guest and the host. This shared memory must be |
| 173 | +marked decrypted to enable communication. Furthermore, since the threat model |
| 174 | +includes a compromised and potentially malicious host, the guest must guard |
| 175 | +against leaking any unintended data to the host through this shared memory. |
| 176 | + |
| 177 | +These Hyper-V and VMBus memory pages are marked as decrypted: |
| 178 | + |
| 179 | +* VMBus monitor pages |
| 180 | + |
| 181 | +* Synthetic interrupt controller (synic) related pages (unless supplied by |
| 182 | + the paravisor) |
| 183 | + |
| 184 | +* Per-cpu hypercall input and output pages (unless running with a paravisor) |
| 185 | + |
| 186 | +* VMBus ring buffers. The direct mapping is marked decrypted in |
| 187 | + __vmbus_establish_gpadl(). The secondary mapping created in |
| 188 | + hv_ringbuffer_init() must also include the "decrypted" attribute. |
| 189 | + |
| 190 | +When the guest writes data to memory that is shared with the host, it must |
| 191 | +ensure that only the intended data is written. Padding or unused fields must |
| 192 | +be initialized to zeros before copying into the shared memory so that random |
| 193 | +kernel data is not inadvertently given to the host. |
| 194 | + |
| 195 | +Similarly, when the guest reads memory that is shared with the host, it must |
| 196 | +validate the data before acting on it so that a malicious host cannot induce |
| 197 | +the guest to expose unintended data. Doing such validation can be tricky |
| 198 | +because the host can modify the shared memory areas even while or after |
| 199 | +validation is performed. For messages passed from the host to the guest in a |
| 200 | +VMBus ring buffer, the length of the message is validated, and the message is |
| 201 | +copied into a temporary (encrypted) buffer for further validation and |
| 202 | +processing. The copying adds a small amount of overhead, but is the only way |
| 203 | +to protect against a malicious host. See hv_pkt_iter_first(). |
| 204 | + |
| 205 | +Many drivers for VMBus devices have been "hardened" by adding code to fully |
| 206 | +validate messages received over VMBus, instead of assuming that Hyper-V is |
| 207 | +acting cooperatively. Such drivers are marked as "allowed_in_isolated" in the |
| 208 | +vmbus_devs[] table. Other drivers for VMBus devices that are not needed in a |
| 209 | +CoCo VM have not been hardened, and they are not allowed to load in a CoCo |
| 210 | +VM. See vmbus_is_valid_offer() where such devices are excluded. |
| 211 | + |
| 212 | +Two VMBus devices depend on the Hyper-V host to do DMA data transfers: |
| 213 | +storvsc for disk I/O and netvsc for network I/O. storvsc uses the normal |
| 214 | +Linux kernel DMA APIs, and so bounce buffering through decrypted swiotlb |
| 215 | +memory is done implicitly. netvsc has two modes for data transfers. The first |
| 216 | +mode goes through send and receive buffer space that is explicitly allocated |
| 217 | +by the netvsc driver, and is used for most smaller packets. These send and |
| 218 | +receive buffers are marked decrypted by __vmbus_establish_gpadl(). Because |
| 219 | +the netvsc driver explicitly copies packets to/from these buffers, the |
| 220 | +equivalent of bounce buffering between encrypted and decrypted memory is |
| 221 | +already part of the data path. The second mode uses the normal Linux kernel |
| 222 | +DMA APIs, and is bounce buffered through swiotlb memory implicitly like in |
| 223 | +storvsc. |
| 224 | + |
| 225 | +Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM. |
| 226 | +Linux PCI device drivers access PCI config space using standard APIs provided |
| 227 | +by the Linux PCI subsystem. On Hyper-V, these functions directly access MMIO |
| 228 | +space, and the access traps to Hyper-V for emulation. But in CoCo VMs, memory |
| 229 | +encryption prevents Hyper-V from reading the guest instruction stream to |
| 230 | +emulate the access. So in a CoCo VM, these functions must make a hypercall |
| 231 | +with arguments explicitly describing the access. See |
| 232 | +_hv_pcifront_read_config() and _hv_pcifront_write_config() and the |
| 233 | +"use_calls" flag indicating to use hypercalls. |
| 234 | + |
| 235 | +load_unaligned_zeropad() |
| 236 | +------------------------ |
| 237 | +When transitioning memory between encrypted and decrypted, the caller of |
| 238 | +set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring |
| 239 | +the memory isn't in use and isn't referenced while the transition is in |
| 240 | +progress. The transition has multiple steps, and includes interaction with |
| 241 | +the Hyper-V host. The memory is in an inconsistent state until all steps are |
| 242 | +complete. A reference while the state is inconsistent could result in an |
| 243 | +exception that can't be cleanly fixed up. |
| 244 | + |
| 245 | +However, the kernel load_unaligned_zeropad() mechanism may make stray |
| 246 | +references that can't be prevented by the caller of set_memory_encrypted() or |
| 247 | +set_memory_decrypted(), so there's specific code in the #VC or #VE exception |
| 248 | +handler to fixup this case. But a CoCo VM running on Hyper-V may be |
| 249 | +configured to run with a paravisor, with the #VC or #VE exception routed to |
| 250 | +the paravisor. There's no architectural way to forward the exceptions back to |
| 251 | +the guest kernel, and in such a case, the load_unaligned_zeropad() fixup code |
| 252 | +in the #VC/#VE handlers doesn't run. |
| 253 | + |
| 254 | +To avoid this problem, the Hyper-V specific functions for notifying the |
| 255 | +hypervisor of the transition mark pages as "not present" while a transition |
| 256 | +is in progress. If load_unaligned_zeropad() causes a stray reference, a |
| 257 | +normal page fault is generated instead of #VC or #VE, and the page-fault- |
| 258 | +based handlers for load_unaligned_zeropad() fixup the reference. When the |
| 259 | +encrypted/decrypted transition is complete, the pages are marked as "present" |
| 260 | +again. See hv_vtom_clear_present() and hv_vtom_set_host_visibility(). |
0 commit comments