Skip to content

Commit fb92a1f

Browse files
committed
Merge tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu: - Add a documentation overview of Confidential Computing VM support (Michael Kelley) - Use lapic timer in a TDX VM without paravisor (Dexuan Cui) - Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency (Michael Kelley) - Fix a kexec crash due to VP assist page corruption (Anirudh Rayabharam) - Python3 compatibility fix for lsvmbus (Anthony Nandaa) - Misc fixes (Rachel Menge, Roman Kisel, zhang jiao, Hongbo Li) * tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hv: vmbus: Constify struct kobj_type and struct attribute_group tools: hv: rm .*.cmd when make clean x86/hyperv: fix kexec crash due to VP assist page corruption Drivers: hv: vmbus: Fix the misplaced function description tools: hv: lsvmbus: change shebang to use python3 x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency Documentation: hyperv: Add overview of Confidential Computing VM support clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor Drivers: hv: Remove deprecated hv_fcopy declarations
2 parents da3ea35 + 8953848 commit fb92a1f

File tree

11 files changed

+302
-22
lines changed

11 files changed

+302
-22
lines changed

Documentation/virt/hyperv/coco.rst

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
Confidential Computing VMs
4+
==========================
5+
Hyper-V can create and run Linux guests that are Confidential Computing
6+
(CoCo) VMs. Such VMs cooperate with the physical processor to better protect
7+
the confidentiality and integrity of data in the VM's memory, even in the
8+
face of a hypervisor/VMM that has been compromised and may behave maliciously.
9+
CoCo VMs on Hyper-V share the generic CoCo VM threat model and security
10+
objectives described in Documentation/security/snp-tdx-threat-model.rst. Note
11+
that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or
12+
"isolation VMs".
13+
14+
A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the
15+
following:
16+
17+
* Physical hardware with a processor that supports CoCo VMs
18+
19+
* The hardware runs a version of Windows/Hyper-V with support for CoCo VMs
20+
21+
* The VM runs a version of Linux that supports being a CoCo VM
22+
23+
The physical hardware requirements are as follows:
24+
25+
* AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME,
26+
SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo
27+
VM on Hyper-V.
28+
29+
* Intel processor with TDX
30+
31+
To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V
32+
when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM,
33+
or vice versa, after it is created.
34+
35+
Operational Modes
36+
-----------------
37+
Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is
38+
created and cannot be changed during the life of the VM.
39+
40+
* Fully-enlightened mode. In this mode, the guest operating system is
41+
enlightened to understand and manage all aspects of running as a CoCo VM.
42+
43+
* Paravisor mode. In this mode, a paravisor layer between the guest and the
44+
host provides some operations needed to run as a CoCo VM. The guest operating
45+
system can have fewer CoCo enlightenments than is required in the
46+
fully-enlightened case.
47+
48+
Conceptually, fully-enlightened mode and paravisor mode may be treated as
49+
points on a spectrum spanning the degree of guest enlightenment needed to run
50+
as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full
51+
implementation of paravisor mode is the other end of the spectrum, where all
52+
aspects of running as a CoCo VM are handled by the paravisor, and a normal
53+
guest OS with no knowledge of memory encryption or other aspects of CoCo VMs
54+
can run successfully. However, the Hyper-V implementation of paravisor mode
55+
does not go this far, and is somewhere in the middle of the spectrum. Some
56+
aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS
57+
must be enlightened for other aspects. Unfortunately, there is no
58+
standardized enumeration of feature/functions that might be provided in the
59+
paravisor, and there is no standardized mechanism for a guest OS to query the
60+
paravisor for the feature/functions it provides. The understanding of what
61+
the paravisor provides is hard-coded in the guest OS.
62+
63+
Paravisor mode has similarities to the `Coconut project`_, which aims to provide
64+
a limited paravisor to provide services to the guest such as a virtual TPM.
65+
However, the Hyper-V paravisor generally handles more aspects of CoCo VMs
66+
than is currently envisioned for Coconut, and so is further toward the "no
67+
guest enlightenments required" end of the spectrum.
68+
69+
.. _Coconut project: https://github.com/coconut-svsm/svsm
70+
71+
In the CoCo VM threat model, the paravisor is in the guest security domain
72+
and must be trusted by the guest OS. By implication, the hypervisor/VMM must
73+
protect itself against a potentially malicious paravisor just like it
74+
protects against a potentially malicious guest.
75+
76+
The hardware architectural approach to fully-enlightened vs. paravisor mode
77+
varies depending on the underlying processor.
78+
79+
* With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in
80+
VMPL 0 and has full control of the guest context. In paravisor mode, the
81+
guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor
82+
running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have.
83+
Certain operations require the guest to invoke the paravisor. Furthermore, in
84+
paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode
85+
as defined by the SEV-SNP architecture. This mode simplifies guest management
86+
of memory encryption when a paravisor is used.
87+
88+
* With Intel TDX processor, in fully-enlightened mode the guest OS runs in an
89+
L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the
90+
L1 VM, and the guest OS runs in a nested L2 VM.
91+
92+
Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This
93+
MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and
94+
whether a paravisor is being used. It is straightforward to build a single
95+
kernel image that can boot and run properly on either architecture, and in
96+
either mode.
97+
98+
Paravisor Effects
99+
-----------------
100+
Running in paravisor mode affects the following areas of generic Linux kernel
101+
CoCo VM functionality:
102+
103+
* Initial guest memory setup. When a new VM is created in paravisor mode, the
104+
paravisor runs first and sets up the guest physical memory as encrypted. The
105+
guest Linux does normal memory initialization, except for explicitly marking
106+
appropriate ranges as decrypted (shared). In paravisor mode, Linux does not
107+
perform the early boot memory setup steps that are particularly tricky with
108+
AMD SEV-SNP in fully-enlightened mode.
109+
110+
* #VC/#VE exception handling. In paravisor mode, Hyper-V configures the guest
111+
CoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM,
112+
respectively, and not the guest Linux. Consequently, these exception handlers
113+
do not run in the guest Linux and are not a required enlightenment for a
114+
Linux guest in paravisor mode.
115+
116+
* CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in the
117+
guest indicating that the VM is operating with the respective hardware
118+
support. While these CPUID flags are visible in fully-enlightened CoCo VMs,
119+
the paravisor filters out these flags and the guest Linux does not see them.
120+
Throughout the Linux kernel, explicitly testing these flags has mostly been
121+
eliminated in favor of the cc_platform_has() function, with the goal of
122+
abstracting the differences between SEV-SNP and TDX. But the
123+
cc_platform_has() abstraction also allows the Hyper-V paravisor configuration
124+
to selectively enable aspects of CoCo VM functionality even when the CPUID
125+
flags are not set. The exception is early boot memory setup on SEV-SNP, which
126+
tests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisor
127+
mode VM achieves the desired effect or not running SEV-SNP specific early
128+
boot memory setup.
129+
130+
* Device emulation. In paravisor mode, the Hyper-V paravisor provides
131+
emulation of devices such as the IO-APIC and TPM. Because the emulation
132+
happens in the paravisor in the guest context (instead of the hypervisor/VMM
133+
context), MMIO accesses to these devices must be encrypted references instead
134+
of the decrypted references that would be used in a fully-enlightened CoCo
135+
VM. The __ioremap_caller() function has been enhanced to make a callback to
136+
check whether a particular address range should be treated as encrypted
137+
(private). See the "is_private_mmio" callback.
138+
139+
* Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guest
140+
memory between encrypted and decrypted requires coordinating with the
141+
hypervisor/VMM. This is done via callbacks invoked from
142+
__set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP and
143+
TDX implementations of these callbacks are used. In paravisor mode, a Hyper-V
144+
specific set of callbacks is used. These callbacks invoke the paravisor so
145+
that the paravisor can coordinate the transitions and inform the hypervisor
146+
as necessary. See hv_vtom_init() where these callback are set up.
147+
148+
* Interrupt injection. In fully enlightened mode, a malicious hypervisor
149+
could inject interrupts into the guest OS at times that violate x86/x64
150+
architectural rules. For full protection, the guest OS should include
151+
enlightenments that use the interrupt injection management features provided
152+
by CoCo-capable processors. In paravisor mode, the paravisor mediates
153+
interrupt injection into the guest OS, and ensures that the guest OS only
154+
sees interrupts that are "legal". The paravisor uses the interrupt injection
155+
management features provided by the CoCo-capable physical processor, thereby
156+
masking these complexities from the guest OS.
157+
158+
Hyper-V Hypercalls
159+
------------------
160+
When in fully-enlightened mode, hypercalls made by the Linux guest are routed
161+
directly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode,
162+
normal hypercalls trap to the paravisor first, which may in turn invoke the
163+
hypervisor. But the paravisor is idiosyncratic in this regard, and a few
164+
hypercalls made by the Linux guest must always be routed directly to the
165+
hypervisor. These hypercall sites test for a paravisor being present, and use
166+
a special invocation sequence. See hv_post_message(), for example.
167+
168+
Guest communication with Hyper-V
169+
--------------------------------
170+
Separate from the generic Linux kernel handling of memory encryption in Linux
171+
CoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memory
172+
shared between the Linux guest and the host. This shared memory must be
173+
marked decrypted to enable communication. Furthermore, since the threat model
174+
includes a compromised and potentially malicious host, the guest must guard
175+
against leaking any unintended data to the host through this shared memory.
176+
177+
These Hyper-V and VMBus memory pages are marked as decrypted:
178+
179+
* VMBus monitor pages
180+
181+
* Synthetic interrupt controller (synic) related pages (unless supplied by
182+
the paravisor)
183+
184+
* Per-cpu hypercall input and output pages (unless running with a paravisor)
185+
186+
* VMBus ring buffers. The direct mapping is marked decrypted in
187+
__vmbus_establish_gpadl(). The secondary mapping created in
188+
hv_ringbuffer_init() must also include the "decrypted" attribute.
189+
190+
When the guest writes data to memory that is shared with the host, it must
191+
ensure that only the intended data is written. Padding or unused fields must
192+
be initialized to zeros before copying into the shared memory so that random
193+
kernel data is not inadvertently given to the host.
194+
195+
Similarly, when the guest reads memory that is shared with the host, it must
196+
validate the data before acting on it so that a malicious host cannot induce
197+
the guest to expose unintended data. Doing such validation can be tricky
198+
because the host can modify the shared memory areas even while or after
199+
validation is performed. For messages passed from the host to the guest in a
200+
VMBus ring buffer, the length of the message is validated, and the message is
201+
copied into a temporary (encrypted) buffer for further validation and
202+
processing. The copying adds a small amount of overhead, but is the only way
203+
to protect against a malicious host. See hv_pkt_iter_first().
204+
205+
Many drivers for VMBus devices have been "hardened" by adding code to fully
206+
validate messages received over VMBus, instead of assuming that Hyper-V is
207+
acting cooperatively. Such drivers are marked as "allowed_in_isolated" in the
208+
vmbus_devs[] table. Other drivers for VMBus devices that are not needed in a
209+
CoCo VM have not been hardened, and they are not allowed to load in a CoCo
210+
VM. See vmbus_is_valid_offer() where such devices are excluded.
211+
212+
Two VMBus devices depend on the Hyper-V host to do DMA data transfers:
213+
storvsc for disk I/O and netvsc for network I/O. storvsc uses the normal
214+
Linux kernel DMA APIs, and so bounce buffering through decrypted swiotlb
215+
memory is done implicitly. netvsc has two modes for data transfers. The first
216+
mode goes through send and receive buffer space that is explicitly allocated
217+
by the netvsc driver, and is used for most smaller packets. These send and
218+
receive buffers are marked decrypted by __vmbus_establish_gpadl(). Because
219+
the netvsc driver explicitly copies packets to/from these buffers, the
220+
equivalent of bounce buffering between encrypted and decrypted memory is
221+
already part of the data path. The second mode uses the normal Linux kernel
222+
DMA APIs, and is bounce buffered through swiotlb memory implicitly like in
223+
storvsc.
224+
225+
Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM.
226+
Linux PCI device drivers access PCI config space using standard APIs provided
227+
by the Linux PCI subsystem. On Hyper-V, these functions directly access MMIO
228+
space, and the access traps to Hyper-V for emulation. But in CoCo VMs, memory
229+
encryption prevents Hyper-V from reading the guest instruction stream to
230+
emulate the access. So in a CoCo VM, these functions must make a hypercall
231+
with arguments explicitly describing the access. See
232+
_hv_pcifront_read_config() and _hv_pcifront_write_config() and the
233+
"use_calls" flag indicating to use hypercalls.
234+
235+
load_unaligned_zeropad()
236+
------------------------
237+
When transitioning memory between encrypted and decrypted, the caller of
238+
set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring
239+
the memory isn't in use and isn't referenced while the transition is in
240+
progress. The transition has multiple steps, and includes interaction with
241+
the Hyper-V host. The memory is in an inconsistent state until all steps are
242+
complete. A reference while the state is inconsistent could result in an
243+
exception that can't be cleanly fixed up.
244+
245+
However, the kernel load_unaligned_zeropad() mechanism may make stray
246+
references that can't be prevented by the caller of set_memory_encrypted() or
247+
set_memory_decrypted(), so there's specific code in the #VC or #VE exception
248+
handler to fixup this case. But a CoCo VM running on Hyper-V may be
249+
configured to run with a paravisor, with the #VC or #VE exception routed to
250+
the paravisor. There's no architectural way to forward the exceptions back to
251+
the guest kernel, and in such a case, the load_unaligned_zeropad() fixup code
252+
in the #VC/#VE handlers doesn't run.
253+
254+
To avoid this problem, the Hyper-V specific functions for notifying the
255+
hypervisor of the transition mark pages as "not present" while a transition
256+
is in progress. If load_unaligned_zeropad() causes a stray reference, a
257+
normal page fault is generated instead of #VC or #VE, and the page-fault-
258+
based handlers for load_unaligned_zeropad() fixup the reference. When the
259+
encrypted/decrypted transition is complete, the pages are marked as "present"
260+
again. See hv_vtom_clear_present() and hv_vtom_set_host_visibility().

Documentation/virt/hyperv/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ Hyper-V Enlightenments
1111
vmbus
1212
clocks
1313
vpci
14+
coco

arch/x86/hyperv/hv_init.c

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@
3535
#include <clocksource/hyperv_timer.h>
3636
#include <linux/highmem.h>
3737

38-
int hyperv_init_cpuhp;
3938
u64 hv_current_partition_id = ~0ull;
4039
EXPORT_SYMBOL_GPL(hv_current_partition_id);
4140

@@ -607,8 +606,6 @@ void __init hyperv_init(void)
607606

608607
register_syscore_ops(&hv_syscore_ops);
609608

610-
hyperv_init_cpuhp = cpuhp;
611-
612609
if (cpuid_ebx(HYPERV_CPUID_FEATURES) & HV_ACCESS_PARTITION_ID)
613610
hv_get_partition_id();
614611

@@ -637,7 +634,7 @@ void __init hyperv_init(void)
637634
clean_guest_os_id:
638635
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
639636
hv_ivm_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
640-
cpuhp_remove_state(cpuhp);
637+
cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);
641638
free_ghcb_page:
642639
free_percpu(hv_ghcb_pg);
643640
free_vp_assist_page:

arch/x86/include/asm/mshyperv.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,6 @@ static inline unsigned char hv_get_nmi_reason(void)
4040
}
4141

4242
#if IS_ENABLED(CONFIG_HYPERV)
43-
extern int hyperv_init_cpuhp;
4443
extern bool hyperv_paravisor_present;
4544

4645
extern void *hv_hypercall_pg;

arch/x86/kernel/cpu/mshyperv.c

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -199,8 +199,8 @@ static void hv_machine_shutdown(void)
199199
* Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
200200
* corrupts the old VP Assist Pages and can crash the kexec kernel.
201201
*/
202-
if (kexec_in_progress && hyperv_init_cpuhp > 0)
203-
cpuhp_remove_state(hyperv_init_cpuhp);
202+
if (kexec_in_progress)
203+
cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);
204204

205205
/* The function calls stop_other_cpus(). */
206206
native_machine_shutdown();
@@ -424,6 +424,7 @@ static void __init ms_hyperv_init_platform(void)
424424
ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
425425
x86_platform.calibrate_tsc = hv_get_tsc_khz;
426426
x86_platform.calibrate_cpu = hv_get_tsc_khz;
427+
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
427428
}
428429

429430
if (ms_hyperv.priv_high & HV_ISOLATION) {
@@ -449,9 +450,23 @@ static void __init ms_hyperv_init_platform(void)
449450
ms_hyperv.hints &= ~HV_X64_APIC_ACCESS_RECOMMENDED;
450451

451452
if (!ms_hyperv.paravisor_present) {
452-
/* To be supported: more work is required. */
453+
/*
454+
* Mark the Hyper-V TSC page feature as disabled
455+
* in a TDX VM without paravisor so that the
456+
* Invariant TSC, which is a better clocksource
457+
* anyway, is used instead.
458+
*/
453459
ms_hyperv.features &= ~HV_MSR_REFERENCE_TSC_AVAILABLE;
454460

461+
/*
462+
* The Invariant TSC is expected to be available
463+
* in a TDX VM without paravisor, but if not,
464+
* print a warning message. The slower Hyper-V MSR-based
465+
* Ref Counter should end up being the clocksource.
466+
*/
467+
if (!(ms_hyperv.features & HV_ACCESS_TSC_INVARIANT))
468+
pr_warn("Hyper-V: Invariant TSC is unavailable\n");
469+
455470
/* HV_MSR_CRASH_CTL is unsupported. */
456471
ms_hyperv.misc_features &= ~HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
457472

drivers/clocksource/hyperv_timer.c

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,21 @@ static int hv_stimer_init(unsigned int cpu)
137137
ce->name = "Hyper-V clockevent";
138138
ce->features = CLOCK_EVT_FEAT_ONESHOT;
139139
ce->cpumask = cpumask_of(cpu);
140-
ce->rating = 1000;
140+
141+
/*
142+
* Lower the rating of the Hyper-V timer in a TDX VM without paravisor,
143+
* so the local APIC timer (lapic_clockevent) is the default timer in
144+
* such a VM. The Hyper-V timer is not preferred in such a VM because
145+
* it depends on the slow VM Reference Counter MSR (the Hyper-V TSC
146+
* page is not enbled in such a VM because the VM uses Invariant TSC
147+
* as a better clocksource and it's challenging to mark the Hyper-V
148+
* TSC page shared in very early boot).
149+
*/
150+
if (!ms_hyperv.paravisor_present && hv_isolation_type_tdx())
151+
ce->rating = 90;
152+
else
153+
ce->rating = 1000;
154+
141155
ce->set_state_shutdown = hv_ce_shutdown;
142156
ce->set_state_oneshot = hv_ce_set_oneshot;
143157
ce->set_next_event = hv_ce_set_next_event;

drivers/hv/hv.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -342,9 +342,6 @@ int hv_synic_init(unsigned int cpu)
342342
return 0;
343343
}
344344

345-
/*
346-
* hv_synic_cleanup - Cleanup routine for hv_synic_init().
347-
*/
348345
void hv_synic_disable_regs(unsigned int cpu)
349346
{
350347
struct hv_per_cpu_context *hv_cpu =
@@ -436,6 +433,9 @@ static bool hv_synic_event_pending(void)
436433
return pending;
437434
}
438435

436+
/*
437+
* hv_synic_cleanup - Cleanup routine for hv_synic_init().
438+
*/
439439
int hv_synic_cleanup(unsigned int cpu)
440440
{
441441
struct vmbus_channel *channel, *sc;

drivers/hv/hyperv_vmbus.h

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -380,12 +380,6 @@ void hv_vss_deinit(void);
380380
int hv_vss_pre_suspend(void);
381381
int hv_vss_pre_resume(void);
382382
void hv_vss_onchannelcallback(void *context);
383-
384-
int hv_fcopy_init(struct hv_util_service *srv);
385-
void hv_fcopy_deinit(void);
386-
int hv_fcopy_pre_suspend(void);
387-
int hv_fcopy_pre_resume(void);
388-
void hv_fcopy_onchannelcallback(void *context);
389383
void vmbus_initiate_unload(bool crash);
390384

391385
static inline void hv_poll_channel(struct vmbus_channel *channel,

0 commit comments

Comments
 (0)