Skip to content

Conversation

@vineethrp
Copy link

No description provided.


/* Hypercalls used only during pKVM initialization */
PKVM_HC(init_finalize)
PKVM_HC(init)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I saw the discussion in the google chat that, we may squash or reorder some patches in this serial. If so, it seems this patch can be a SQUASHME patch? If so, I can squash it to pkvm-v6.18 after the review.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I shall add SQUASHME details in the next iteration

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added SQUASHME, but not sure which patch to squash to(multiple patches touches the logic). So I could not add fixes tag to this patch. Sorry about this. Would you be able to figure out how to squash it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can handle that.

asm volatile (
"lgdt %0\n"
"lidt %1\n"
"ltr %w2\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: what about LDT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is - since we are in early boot and in the absence of userspave, LDT is not used and could be ignored. @cxdong, is this assumption true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, we do load it into the VMCS during deprivileging:

	/* Initialize LDTR */
	store_ldt(ldtr);
	vmcs_write16(GUEST_LDTR_SELECTOR, ldtr);
	vmcs_write32(GUEST_LDTR_AR_BYTES, 0x10000);
	vmcs_writel(GUEST_LDTR_BASE, 0x0);
	vmcs_write32(GUEST_LDTR_LIMIT, 0xffffffff);

so we assume it does matter?

Anyway, @cxdong probably knows more.

Side note: I'm curious where the above values 0x10000, 0x0, 0xffffffff come from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I have the same understanding, the ldtr is 0, and the AR/BASE/LIMIT could be a default value. E.g, 0x10000 for AR represents unusable. Maybe a more generic way is to read the AR/BASE/LIMIT from GDT if the ldtr is a non-zero value.

But maybe it makes more sense to restore the guest LDT selector as otherwise we will use the pKVM's LDT, although it is also 0.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I recalled that, IIUC, the GUEST_LDTR_SELECTOR setting (or any other GUEST_xxx_SELECTOR setting) basically has no effect when entering a VM, since the hardware will use the "cached" values programmed via GUEST_xxx_BASE, GUEST_xxx_LIMIT, GUEST_xxx_AR_BYTES instead.

So we are not really preserving the host's LDT selector during deprivileging, i.e. we are not really assuming that LDT matters at this point. If that's ok, then it might be ok to not do anything with it during reprivileging either...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra context. So, I think we can just ignore LDT for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is irrelavent with the "cached". The "cached" thing makes the host VM itself to use the LDT base/limit/ar programeed by the VMCS rather than the value loaded from its own GDT memory page, which is the unusable hidden part we programmed in the VMCS before deprivilege.

But here after a CPU is deprivileged and vmexit to the pKVM hypervisor to reprivilege, the LDT selector and hiddent part (base/limit/AR) switch to the pKVM hypervisor's value(in real, the pKVM hypervisor doesn't set LDT so its LDT selector is always 0 and LDT hidden part is also unusable). If we don't restore the host's LDT in the root mode, then the LDT will be remained as the root mode's value. We are good without restore the host's LDT selector because both sides are not using LDT so the LDT selector and base/limit/ar are the same for them. But if the pKVM hypervisor uses LDT after a vmexit, the root mode may use a different LDT hidden part with the host, then this seems will be problematic without restoring the host's LDT (which is unused) during reprivileging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the pKVM hypervisor uses LDT after a vmexit

Just wondering: do we foresee any use cases for using LDT in pKVM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering: do we foresee any use cases for using LDT in pKVM?

No, I think we don't foresee any use cases for that.

return 0;

repriv_cpus:
pkvm_host_reprivilege_cpus();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, how about temporarily making pKVM initialization always fail (and thus trigger reprivileging), until pvVMCS is ready? So that: 1. we can enable pKVM without breaking KVM functionality. 2. we can easily test (teamfood) reprivileging functionality just by enabling pKVM?

Ideally we could do that as a part of the patch e636d62 which enables deprivileging (which means that reprivileging would need to be implemented even before that patch), so that there is no single commit that makes KVM non-workable. Although we probably don't need to make the order of patches so perfect for now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea. I have added a separate "REVERTME" commit for this

wrmsrl(MSR_GS_BASE, hcs->gsbase);

wrmsrq(MSR_IA32_DEBUGCTLMSR, hcs->debugctl);
wrmsrq(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With such a testing hack:

--- a/arch/x86/kvm/vmx/pkvm_init.c
+++ b/arch/x86/kvm/vmx/pkvm_init.c
@@ -995,7 +995,7 @@ static __init void pkvm_host_deprivilege_cpu(void *data)
                goto vmxoff;
        }
 
-       ret = local_deprivilege_cpu();
+       ret = (cpu == 7) ? -EFAULT : local_deprivilege_cpu();
        if (ret) {
                pr_err("CPU%d deprivilege failed, ret %d\n", cpu, ret);
                goto vmxoff;

when running in qemu, it crashes during reprivileging:

[    0.891677] pkvm: Failed to deprivilege CPU7: smp_call 0, deprivilege: -14
[    0.893057] Oops: general protection fault, maybe for address 0x0: 0000 [#1] SMP NOPTI
[    0.894874] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.18.0+ #178 PREEMPT(voluntary) 
[    0.896629] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[    0.896970] RIP: 0010:pkvm_vmx_reprivilege_cpu__pkvm+0x309/0x690
[    0.896970] Code: 0f 30 48 8b 05 68 53 bc 01 b9 d9 01 00 00 48 89 c2 48 c1 ea 20 0f 30 48 8b 05 5b 53 bc 01 b9 8f 03 00 00 48 89 c2 48 c1 ea 20 <0f> 30 48 8b 05 4e 53 bc 01 b9 74 01 00 00 48 89 c2 48 c1 ea 20 0f
[    0.896970] RSP: 0000:ffff88817a026f08 EFLAGS: 00010046
[    0.896970] RAX: 0000000000000000 RBX: ffff88817a02f120 RCX: 000000000000038f
[    0.896970] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88817a02f120
[    0.916488] RBP: ffff88817a026f18 R08: 0000000000000405 R09: 0000000000000000
[    0.916488] R10: 0000000000000000 R11: ffffc90000003ff8 R12: ffffffff835009bd
[    0.916488] R13: 0000000000000002 R14: ffff88817a02f000 R15: ffffffff83500970
[    0.916488] FS:  0000000000000000(0000) GS:ffff8881f5eaa000(0000) knlGS:0000000000000000
[    0.916488] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.916488] CR2: ffff88817ffff000 CR3: 000000000323a001 CR4: 0000000000772ef0
[    0.916488] PKRU: 55555554
[    0.916488] Call Trace:
[    0.916488] Modules linked in:
[    0.916488] ---[ end trace 0000000000000000 ]---
[    0.916488] RIP: 0010:pkvm_vmx_reprivilege_cpu__pkvm+0x309/0x690

Debugging shows that it crashes on writing MSR_CORE_PERF_GLOBAL_CTRL. Apparently this MSR is not implemented by qemu. And I assume the hypervisor's extable is not workable at this point, since pkvm_init() has not even started.

On real hardware this MSR it implemented, so this crash is not observed.

Anyway, this trick fixes it:

--- a/arch/x86/kvm/pkvm/vmx/host_repriv.c
+++ b/arch/x86/kvm/pkvm/vmx/host_repriv.c
@@ -107,7 +107,8 @@ static inline void restore_host_special_regs(struct host_cpu_state *hcs)
        wrmsrl(MSR_GS_BASE, hcs->gsbase);
 
        wrmsrq(MSR_IA32_DEBUGCTLMSR, hcs->debugctl);
-       wrmsrq(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl);
+       if (hcs->perf_global_ctrl)
+               wrmsrq(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl);
        wrmsrq(MSR_IA32_SYSENTER_CS, hcs->sysenter_cs);
        wrmsrq(MSR_IA32_SYSENTER_ESP, hcs->sysenter_esp);
        wrmsrq(MSR_IA32_SYSENTER_EIP, hcs->sysenter_eip);

I suppose this fix is correct, since pKVM ensures that while running in the pKVM hypervisor, this MSR value is always 0 (see Kevin's patch 1604e08), so if it is 0 in the host as well, we don't need to restore it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this, I haven't tested much on qemu after reviving my minipc for testing. Your fix looks good to me. The harder but correct way is to check pmu version id >= 2, but I guess this simple fix would be best as pkvm would not theoretically be supported on older hardware and this fix be just for qemu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good cache. This reminds me that, the wrmsrl is going to deprecated, so it is better we can use wrmsrq instead.

And because MSR_CORE_PERF_GLOBAL_CTRL relies on some CPUID, and now the pKVM hypervisor can support the safe version of wrmsrq, this can also be fixed by using wrmsrq_safe rather than wrmsrq. To be noted, this should be done before we switching to the host's IDT and GS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can also be fixed by using wrmsrq_safe rather than wrmsrq. To be noted, this should be done before we switching to the host's IDT and GS.

The problem is that, for example, in this test with failing local_deprivilege_cpu(), reprivileging is done too early, when pKVM's exception fixup is not workable yet. I've just made an experiment: replaced wrmsrq with wrmsrq_safe, but it still crashes, the same way. I haven't investigated why exactly, but I suppose at least because pKVM hasn't called sort_extable() yet (since the host hasn't even made the init hypercall to pKVM yet)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, just noticed that we indeed write MSRs after switching the IDT, not before (which explains why this #GP call trace is printed by the host kernel, i.e. why it is handled by the host's exception handler).

So tried this:

--- a/arch/x86/kvm/pkvm/vmx/host_repriv.c
+++ b/arch/x86/kvm/pkvm/vmx/host_repriv.c
@@ -88,6 +88,8 @@ static inline void restore_host_special_regs(struct host_cpu_state *hcs)
        __pkvm_write_cr0(hcs->cr0);
        __pkvm_write_cr3(hcs->cr3);
 
+       wrmsrq_safe(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl);
+
        asm volatile (
                "lgdt %0\n"
                "lidt %1\n"
@@ -107,7 +109,6 @@ static inline void restore_host_special_regs(struct host_cpu_state *hcs)
        wrmsrl(MSR_GS_BASE, hcs->gsbase);
 
        wrmsrq(MSR_IA32_DEBUGCTLMSR, hcs->debugctl);
-       wrmsrq(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl);
        wrmsrq(MSR_IA32_SYSENTER_CS, hcs->sysenter_cs);
        wrmsrq(MSR_IA32_SYSENTER_ESP, hcs->sysenter_esp);
        wrmsrq(MSR_IA32_SYSENTER_EIP, hcs->sysenter_eip);

and this helps: kernel is booting successfully.

But still, is it really guaranteed to work if pKVM hasn't called sort_extable() yet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Although the real hardware has the MSR_CORE_PERF_GLOBAL_CTRL, a QEMU vm can be implemented not to support it. Created PR #55 to fix this.

Meanwhile, also moved the exception table sort from pkvm_init_finalize to the host side before deprivilege in PR #55. As excpetion fixup doesn't rely on the initialization done by the pkvm_init_finalize, sort the exception table can be done earlier. So move it to the deprivilege phase to make the exception fixup mechanism functional once the deprivileged is done. With this, we can also use wrmsrq_safe to fix this issue without checking hcs->perf_global_ctrl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks. PR #55 looks good to me but I see it is not actually working as it should on real HW on hybrid CPU (i.e., on any CPUs we actually care about), I've added a comment in the PR about that. But that's hopefully easy to fix.

With that fixed, I guess we should do wrmsrq_safe(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl) before switching IDT, that seems cleaner than my fix.

Maybe we can also move writing other MSRs (maybe except FS_BASE and GS_BASE) before switching IDT, together with MSR_CORE_PERF_GLOBAL_CTRL. Or maybe no need to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #55 has been merged and squashed to pkvm-v6.18 branch

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks. I will rebase on that and fix the msr restore logic and update the PR soon

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all MSR restoration above GDT restoration and used wrmsrq_safe for MSR_CORE_PERF_GLOBAL_CTRL.

@cxdong cxdong force-pushed the pkvm-v6.18 branch 2 times, most recently from 407369e to 24beab8 Compare January 8, 2026 05:03
Vineeth Pillai (Google) added 3 commits January 8, 2026 10:57
init_finalize hypercall name is a slight misnomer - the hypercall
basically initializes pKVM hypervisor and not just the finalization
part. More over, we would be using the finalize terminology later when
we handle undoing the initialization and reprivileging cpus in the case
of initialization failure.

So rename the hypercall and related functions in pKVM and host.

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
Mechanism to undo the initialization logic and reprivilege the cpus back
to bare metal if pKVM initialization fails.

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
These macros will be used by reprivilege code as well and hence, move it
to a common header file.

Fixes: 321aa36 (pKVM: VMX: Setup the RSP and RIP fields for the pKVM hypervisor)

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
Reprivilege feature allows host to boot as bare metal in case of pKVM
initialization failure. All deprivileged cpus(VMX non-root mode) will be
transitioned back to VMX root mode and then to non-VMX mode so that host
can boot without pKVM.

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
Until pVMCS is ready, temporarily fail pKVM initialization so that the
reprivilege code gets more test coverage.

Signed-off-by: Vineeth Pillai (Google) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants