Merge tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

torvalds · torvalds · commit fb92a1ffc121 · 2024-09-09T09:31:55.000-07:00
Pull hyperv fixes from Wei Liu:

 - Add a documentation overview of Confidential Computing VM support
   (Michael Kelley)

 - Use lapic timer in a TDX VM without paravisor (Dexuan Cui)

 - Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
   (Michael Kelley)

 - Fix a kexec crash due to VP assist page corruption (Anirudh
   Rayabharam)

 - Python3 compatibility fix for lsvmbus (Anthony Nandaa)

 - Misc fixes (Rachel Menge, Roman Kisel, zhang jiao, Hongbo Li)

* tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  hv: vmbus: Constify struct kobj_type and struct attribute_group
  tools: hv: rm .*.cmd when make clean
  x86/hyperv: fix kexec crash due to VP assist page corruption
  Drivers: hv: vmbus: Fix the misplaced function description
  tools: hv: lsvmbus: change shebang to use python3
  x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
  Documentation: hyperv: Add overview of Confidential Computing VM support
  clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor
  Drivers: hv: Remove deprecated hv_fcopy declarations
diff --git a/Documentation/virt/hyperv/coco.rst b/Documentation/virt/hyperv/coco.rst
@@ -0,0 +1,260 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Confidential Computing VMs
+==========================
+Hyper-V can create and run Linux guests that are Confidential Computing
+(CoCo) VMs. Such VMs cooperate with the physical processor to better protect
+the confidentiality and integrity of data in the VM's memory, even in the
+face of a hypervisor/VMM that has been compromised and may behave maliciously.
+CoCo VMs on Hyper-V share the generic CoCo VM threat model and security
+objectives described in Documentation/security/snp-tdx-threat-model.rst. Note
+that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or
+"isolation VMs".
+
+A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the
+following:
+
+* Physical hardware with a processor that supports CoCo VMs
+
+* The hardware runs a version of Windows/Hyper-V with support for CoCo VMs
+
+* The VM runs a version of Linux that supports being a CoCo VM
+
+The physical hardware requirements are as follows:
+
+* AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME,
+  SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo
+  VM on Hyper-V.
+
+* Intel processor with TDX
+
+To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V
+when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM,
+or vice versa, after it is created.
+
+Operational Modes
+-----------------
+Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is
+created and cannot be changed during the life of the VM.
+
+* Fully-enlightened mode. In this mode, the guest operating system is
+  enlightened to understand and manage all aspects of running as a CoCo VM.
+
+* Paravisor mode. In this mode, a paravisor layer between the guest and the
+  host provides some operations needed to run as a CoCo VM. The guest operating
+  system can have fewer CoCo enlightenments than is required in the
+  fully-enlightened case.
+
+Conceptually, fully-enlightened mode and paravisor mode may be treated as
+points on a spectrum spanning the degree of guest enlightenment needed to run
+as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full
+implementation of paravisor mode is the other end of the spectrum, where all
+aspects of running as a CoCo VM are handled by the paravisor, and a normal
+guest OS with no knowledge of memory encryption or other aspects of CoCo VMs
+can run successfully. However, the Hyper-V implementation of paravisor mode
+does not go this far, and is somewhere in the middle of the spectrum. Some
+aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS
+must be enlightened for other aspects. Unfortunately, there is no
+standardized enumeration of feature/functions that might be provided in the
+paravisor, and there is no standardized mechanism for a guest OS to query the
+paravisor for the feature/functions it provides. The understanding of what
+the paravisor provides is hard-coded in the guest OS.
+
+Paravisor mode has similarities to the `Coconut project`_, which aims to provide
+a limited paravisor to provide services to the guest such as a virtual TPM.
+However, the Hyper-V paravisor generally handles more aspects of CoCo VMs
+than is currently envisioned for Coconut, and so is further toward the "no
+guest enlightenments required" end of the spectrum.
+
+.. _Coconut project: https://github.com/coconut-svsm/svsm
+
+In the CoCo VM threat model, the paravisor is in the guest security domain
+and must be trusted by the guest OS. By implication, the hypervisor/VMM must
+protect itself against a potentially malicious paravisor just like it
+protects against a potentially malicious guest.
+
+The hardware architectural approach to fully-enlightened vs. paravisor mode
+varies depending on the underlying processor.
+
+* With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in
+  VMPL 0 and has full control of the guest context. In paravisor mode, the
+  guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor
+  running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have.
+  Certain operations require the guest to invoke the paravisor. Furthermore, in
+  paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode
+  as defined by the SEV-SNP architecture. This mode simplifies guest management
+  of memory encryption when a paravisor is used.
+
+* With Intel TDX processor, in fully-enlightened mode the guest OS runs in an
+  L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the
+  L1 VM, and the guest OS runs in a nested L2 VM.
+
+Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This
+MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and
+whether a paravisor is being used. It is straightforward to build a single
+kernel image that can boot and run properly on either architecture, and in
+either mode.
+
+Paravisor Effects
+-----------------
+Running in paravisor mode affects the following areas of generic Linux kernel
+CoCo VM functionality:
+
+* Initial guest memory setup. When a new VM is created in paravisor mode, the
+  paravisor runs first and sets up the guest physical memory as encrypted. The
+  guest Linux does normal memory initialization, except for explicitly marking
+  appropriate ranges as decrypted (shared). In paravisor mode, Linux does not
+  perform the early boot memory setup steps that are particularly tricky with
+  AMD SEV-SNP in fully-enlightened mode.
+
+* #VC/#VE exception handling. In paravisor mode, Hyper-V configures the guest
+  CoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM,
+  respectively, and not the guest Linux. Consequently, these exception handlers
+  do not run in the guest Linux and are not a required enlightenment for a
+  Linux guest in paravisor mode.
+
+* CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in the
+  guest indicating that the VM is operating with the respective hardware
+  support. While these CPUID flags are visible in fully-enlightened CoCo VMs,
+  the paravisor filters out these flags and the guest Linux does not see them.
+  Throughout the Linux kernel, explicitly testing these flags has mostly been
+  eliminated in favor of the cc_platform_has() function, with the goal of
+  abstracting the differences between SEV-SNP and TDX. But the
+  cc_platform_has() abstraction also allows the Hyper-V paravisor configuration
+  to selectively enable aspects of CoCo VM functionality even when the CPUID
+  flags are not set. The exception is early boot memory setup on SEV-SNP, which
+  tests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisor
+  mode VM achieves the desired effect or not running SEV-SNP specific early
+  boot memory setup.
+
+* Device emulation. In paravisor mode, the Hyper-V paravisor provides
+  emulation of devices such as the IO-APIC and TPM. Because the emulation
+  happens in the paravisor in the guest context (instead of the hypervisor/VMM
+  context), MMIO accesses to these devices must be encrypted references instead
+  of the decrypted references that would be used in a fully-enlightened CoCo
+  VM. The __ioremap_caller() function has been enhanced to make a callback to
+  check whether a particular address range should be treated as encrypted
+  (private). See the "is_private_mmio" callback.
+
+* Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guest
+  memory between encrypted and decrypted requires coordinating with the
+  hypervisor/VMM. This is done via callbacks invoked from
+  __set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP and
+  TDX implementations of these callbacks are used. In paravisor mode, a Hyper-V
+  specific set of callbacks is used. These callbacks invoke the paravisor so
+  that the paravisor can coordinate the transitions and inform the hypervisor
+  as necessary. See hv_vtom_init() where these callback are set up.
+
+* Interrupt injection. In fully enlightened mode, a malicious hypervisor
+  could inject interrupts into the guest OS at times that violate x86/x64
+  architectural rules. For full protection, the guest OS should include
+  enlightenments that use the interrupt injection management features provided
+  by CoCo-capable processors. In paravisor mode, the paravisor mediates
+  interrupt injection into the guest OS, and ensures that the guest OS only
+  sees interrupts that are "legal". The paravisor uses the interrupt injection
+  management features provided by the CoCo-capable physical processor, thereby
+  masking these complexities from the guest OS.
+
+Hyper-V Hypercalls
+------------------
+When in fully-enlightened mode, hypercalls made by the Linux guest are routed
+directly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode,
+normal hypercalls trap to the paravisor first, which may in turn invoke the
+hypervisor. But the paravisor is idiosyncratic in this regard, and a few
+hypercalls made by the Linux guest must always be routed directly to the
+hypervisor. These hypercall sites test for a paravisor being present, and use
+a special invocation sequence. See hv_post_message(), for example.
+
+Guest communication with Hyper-V
+--------------------------------
+Separate from the generic Linux kernel handling of memory encryption in Linux
+CoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memory
+shared between the Linux guest and the host. This shared memory must be
+marked decrypted to enable communication. Furthermore, since the threat model
+includes a compromised and potentially malicious host, the guest must guard
+against leaking any unintended data to the host through this shared memory.
+
+These Hyper-V and VMBus memory pages are marked as decrypted:
+
+* VMBus monitor pages
+
+* Synthetic interrupt controller (synic) related pages (unless supplied by
+  the paravisor)
+
+* Per-cpu hypercall input and output pages (unless running with a paravisor)
+
+* VMBus ring buffers. The direct mapping is marked decrypted in
+  __vmbus_establish_gpadl(). The secondary mapping created in
+  hv_ringbuffer_init() must also include the "decrypted" attribute.
+
+When the guest writes data to memory that is shared with the host, it must
+ensure that only the intended data is written. Padding or unused fields must
+be initialized to zeros before copying into the shared memory so that random
+kernel data is not inadvertently given to the host.
+
+Similarly, when the guest reads memory that is shared with the host, it must
+validate the data before acting on it so that a malicious host cannot induce
+the guest to expose unintended data. Doing such validation can be tricky
+because the host can modify the shared memory areas even while or after
+validation is performed. For messages passed from the host to the guest in a
+VMBus ring buffer, the length of the message is validated, and the message is
+copied into a temporary (encrypted) buffer for further validation and
+processing. The copying adds a small amount of overhead, but is the only way
+to protect against a malicious host. See hv_pkt_iter_first().
+
+Many drivers for VMBus devices have been "hardened" by adding code to fully
+validate messages received over VMBus, instead of assuming that Hyper-V is
+acting cooperatively. Such drivers are marked as "allowed_in_isolated" in the
+vmbus_devs[] table. Other drivers for VMBus devices that are not needed in a
+CoCo VM have not been hardened, and they are not allowed to load in a CoCo
+VM. See vmbus_is_valid_offer() where such devices are excluded.
+
+Two VMBus devices depend on the Hyper-V host to do DMA data transfers:
+storvsc for disk I/O and netvsc for network I/O. storvsc uses the normal
+Linux kernel DMA APIs, and so bounce buffering through decrypted swiotlb
+memory is done implicitly. netvsc has two modes for data transfers. The first
+mode goes through send and receive buffer space that is explicitly allocated
+by the netvsc driver, and is used for most smaller packets. These send and
+receive buffers are marked decrypted by __vmbus_establish_gpadl(). Because
+the netvsc driver explicitly copies packets to/from these buffers, the
+equivalent of bounce buffering between encrypted and decrypted memory is
+already part of the data path. The second mode uses the normal Linux kernel
+DMA APIs, and is bounce buffered through swiotlb memory implicitly like in
+storvsc.
+
+Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM.
+Linux PCI device drivers access PCI config space using standard APIs provided
+by the Linux PCI subsystem. On Hyper-V, these functions directly access MMIO
+space, and the access traps to Hyper-V for emulation. But in CoCo VMs, memory
+encryption prevents Hyper-V from reading the guest instruction stream to
+emulate the access. So in a CoCo VM, these functions must make a hypercall
+with arguments explicitly describing the access. See
+_hv_pcifront_read_config() and _hv_pcifront_write_config() and the
+"use_calls" flag indicating to use hypercalls.
+
+load_unaligned_zeropad()
+------------------------
+When transitioning memory between encrypted and decrypted, the caller of
+set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring
+the memory isn't in use and isn't referenced while the transition is in
+progress. The transition has multiple steps, and includes interaction with
+the Hyper-V host. The memory is in an inconsistent state until all steps are
+complete. A reference while the state is inconsistent could result in an
+exception that can't be cleanly fixed up.
+
+However, the kernel load_unaligned_zeropad() mechanism may make stray
+references that can't be prevented by the caller of set_memory_encrypted() or
+set_memory_decrypted(), so there's specific code in the #VC or #VE exception
+handler to fixup this case. But a CoCo VM running on Hyper-V may be
+configured to run with a paravisor, with the #VC or #VE exception routed to
+the paravisor. There's no architectural way to forward the exceptions back to
+the guest kernel, and in such a case, the load_unaligned_zeropad() fixup code
+in the #VC/#VE handlers doesn't run.
+
+To avoid this problem, the Hyper-V specific functions for notifying the
+hypervisor of the transition mark pages as "not present" while a transition
+is in progress. If load_unaligned_zeropad() causes a stray reference, a
+normal page fault is generated instead of #VC or #VE, and the page-fault-
+based handlers for load_unaligned_zeropad() fixup the reference. When the
+encrypted/decrypted transition is complete, the pages are marked as "present"
+again. See hv_vtom_clear_present() and hv_vtom_set_host_visibility().
diff --git a/Documentation/virt/hyperv/index.rst b/Documentation/virt/hyperv/index.rst
@@ -11,3 +11,4 @@ Hyper-V Enlightenments
    vmbus
    clocks
    vpci
+   coco
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
@@ -35,7 +35,6 @@
 #include <clocksource/hyperv_timer.h>
 #include <linux/highmem.h>
 
-int hyperv_init_cpuhp;
 u64 hv_current_partition_id = ~0ull;
 EXPORT_SYMBOL_GPL(hv_current_partition_id);
 
@@ -607,8 +606,6 @@ void __init hyperv_init(void)
 
 	register_syscore_ops(&hv_syscore_ops);
 
-	hyperv_init_cpuhp = cpuhp;
-
 	if (cpuid_ebx(HYPERV_CPUID_FEATURES) & HV_ACCESS_PARTITION_ID)
 		hv_get_partition_id();
 
@@ -637,7 +634,7 @@ void __init hyperv_init(void)
 clean_guest_os_id:
 	wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
 	hv_ivm_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
-	cpuhp_remove_state(cpuhp);
+	cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);
 free_ghcb_page:
 	free_percpu(hv_ghcb_pg);
 free_vp_assist_page:
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
@@ -40,7 +40,6 @@ static inline unsigned char hv_get_nmi_reason(void)
 }
 
 #if IS_ENABLED(CONFIG_HYPERV)
-extern int hyperv_init_cpuhp;
 extern bool hyperv_paravisor_present;
 
 extern void *hv_hypercall_pg;
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
@@ -199,8 +199,8 @@ static void hv_machine_shutdown(void)
 	 * Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
 	 * corrupts the old VP Assist Pages and can crash the kexec kernel.
 	 */
-	if (kexec_in_progress && hyperv_init_cpuhp > 0)
-		cpuhp_remove_state(hyperv_init_cpuhp);
+	if (kexec_in_progress)
+		cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);
 
 	/* The function calls stop_other_cpus(). */
 	native_machine_shutdown();
@@ -424,6 +424,7 @@ static void __init ms_hyperv_init_platform(void)
 	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
 		x86_platform.calibrate_tsc = hv_get_tsc_khz;
 		x86_platform.calibrate_cpu = hv_get_tsc_khz;
+		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 	}
 
 	if (ms_hyperv.priv_high & HV_ISOLATION) {
@@ -449,9 +450,23 @@ static void __init ms_hyperv_init_platform(void)
 			ms_hyperv.hints &= ~HV_X64_APIC_ACCESS_RECOMMENDED;
 
 			if (!ms_hyperv.paravisor_present) {
-				/* To be supported: more work is required.  */
+				/*
+				 * Mark the Hyper-V TSC page feature as disabled
+				 * in a TDX VM without paravisor so that the
+				 * Invariant TSC, which is a better clocksource
+				 * anyway, is used instead.
+				 */
 				ms_hyperv.features &= ~HV_MSR_REFERENCE_TSC_AVAILABLE;
 
+				/*
+				 * The Invariant TSC is expected to be available
+				 * in a TDX VM without paravisor, but if not,
+				 * print a warning message. The slower Hyper-V MSR-based
+				 * Ref Counter should end up being the clocksource.
+				 */
+				if (!(ms_hyperv.features & HV_ACCESS_TSC_INVARIANT))
+					pr_warn("Hyper-V: Invariant TSC is unavailable\n");
+
 				/* HV_MSR_CRASH_CTL is unsupported. */
 				ms_hyperv.misc_features &= ~HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
 
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
@@ -137,7 +137,21 @@ static int hv_stimer_init(unsigned int cpu)
 	ce->name = "Hyper-V clockevent";
 	ce->features = CLOCK_EVT_FEAT_ONESHOT;
 	ce->cpumask = cpumask_of(cpu);
-	ce->rating = 1000;
+
+	/*
+	 * Lower the rating of the Hyper-V timer in a TDX VM without paravisor,
+	 * so the local APIC timer (lapic_clockevent) is the default timer in
+	 * such a VM. The Hyper-V timer is not preferred in such a VM because
+	 * it depends on the slow VM Reference Counter MSR (the Hyper-V TSC
+	 * page is not enbled in such a VM because the VM uses Invariant TSC
+	 * as a better clocksource and it's challenging to mark the Hyper-V
+	 * TSC page shared in very early boot).
+	 */
+	if (!ms_hyperv.paravisor_present && hv_isolation_type_tdx())
+		ce->rating = 90;
+	else
+		ce->rating = 1000;
+
 	ce->set_state_shutdown = hv_ce_shutdown;
 	ce->set_state_oneshot = hv_ce_set_oneshot;
 	ce->set_next_event = hv_ce_set_next_event;
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
@@ -342,9 +342,6 @@ int hv_synic_init(unsigned int cpu)
 	return 0;
 }
 
-/*
- * hv_synic_cleanup - Cleanup routine for hv_synic_init().
- */
 void hv_synic_disable_regs(unsigned int cpu)
 {
 	struct hv_per_cpu_context *hv_cpu =
@@ -436,6 +433,9 @@ static bool hv_synic_event_pending(void)
 	return pending;
 }
 
+/*
+ * hv_synic_cleanup - Cleanup routine for hv_synic_init().
+ */
 int hv_synic_cleanup(unsigned int cpu)
 {
 	struct vmbus_channel *channel, *sc;
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
@@ -380,12 +380,6 @@ void hv_vss_deinit(void);
 int hv_vss_pre_suspend(void);
 int hv_vss_pre_resume(void);
 void hv_vss_onchannelcallback(void *context);
-
-int hv_fcopy_init(struct hv_util_service *srv);
-void hv_fcopy_deinit(void);
-int hv_fcopy_pre_suspend(void);
-int hv_fcopy_pre_resume(void);
-void hv_fcopy_onchannelcallback(void *context);
 void vmbus_initiate_unload(bool crash);
 
 static inline void hv_poll_channel(struct vmbus_channel *channel,
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
diff --git a/tools/hv/Makefile b/tools/hv/Makefile
diff --git a/tools/hv/lsvmbus b/tools/hv/lsvmbus

-Original file line number
+Diff line change
    vmbus
    clocks
    vpci
 +   coco
Original file line number	Diff line number	Diff line change
`@@ -40,7 +40,6 @@ static inline unsigned char hv_get_nmi_reason(void)`
`40`	`40`	`}`
`41`	`41`
`42`	`42`	`#if IS_ENABLED(CONFIG_HYPERV)`
`43`		`-extern int hyperv_init_cpuhp;`
`44`	`43`	`extern bool hyperv_paravisor_present;`
`45`	`44`
`46`	`45`	`extern void *hv_hypercall_pg;`
Original file line number	Diff line number	Diff line change
`@@ -342,9 +342,6 @@ int hv_synic_init(unsigned int cpu)`
`342`	`342`	`return 0;`
`343`	`343`	`}`
`344`	`344`
`345`		`-/*`
`346`		`- * hv_synic_cleanup - Cleanup routine for hv_synic_init().`
`347`		`- */`
`348`	`345`	`void hv_synic_disable_regs(unsigned int cpu)`
`349`	`346`	`{`
`350`	`347`	`struct hv_per_cpu_context *hv_cpu =`
`@@ -436,6 +433,9 @@ static bool hv_synic_event_pending(void)`
`436`	`433`	`return pending;`
`437`	`434`	`}`
`438`	`435`
	`436`	`+/*`
	`437`	`+ * hv_synic_cleanup - Cleanup routine for hv_synic_init().`
	`438`	`+ */`
`439`	`439`	`int hv_synic_cleanup(unsigned int cpu)`
`440`	`440`	`{`
`441`	`441`	`struct vmbus_channel channel, sc;`