-
Notifications
You must be signed in to change notification settings - Fork 17
Reprivilege CPUs on pkvm initialization failure #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: pkvm-v6.18
Are you sure you want to change the base?
Conversation
|
|
||
| /* Hypercalls used only during pKVM initialization */ | ||
| PKVM_HC(init_finalize) | ||
| PKVM_HC(init) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I saw the discussion in the google chat that, we may squash or reorder some patches in this serial. If so, it seems this patch can be a SQUASHME patch? If so, I can squash it to pkvm-v6.18 after the review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I shall add SQUASHME details in the next iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added SQUASHME, but not sure which patch to squash to(multiple patches touches the logic). So I could not add fixes tag to this patch. Sorry about this. Would you be able to figure out how to squash it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can handle that.
b286e2c to
bc7a451
Compare
bc7a451 to
619f995
Compare
619f995 to
c7311f8
Compare
c7311f8 to
702a706
Compare
| asm volatile ( | ||
| "lgdt %0\n" | ||
| "lidt %1\n" | ||
| "ltr %w2\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: what about LDT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is - since we are in early boot and in the absence of userspave, LDT is not used and could be ignored. @cxdong, is this assumption true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, we do load it into the VMCS during deprivileging:
/* Initialize LDTR */
store_ldt(ldtr);
vmcs_write16(GUEST_LDTR_SELECTOR, ldtr);
vmcs_write32(GUEST_LDTR_AR_BYTES, 0x10000);
vmcs_writel(GUEST_LDTR_BASE, 0x0);
vmcs_write32(GUEST_LDTR_LIMIT, 0xffffffff);
so we assume it does matter?
Anyway, @cxdong probably knows more.
Side note: I'm curious where the above values 0x10000, 0x0, 0xffffffff come from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I have the same understanding, the ldtr is 0, and the AR/BASE/LIMIT could be a default value. E.g, 0x10000 for AR represents unusable. Maybe a more generic way is to read the AR/BASE/LIMIT from GDT if the ldtr is a non-zero value.
But maybe it makes more sense to restore the guest LDT selector as otherwise we will use the pKVM's LDT, although it is also 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I recalled that, IIUC, the GUEST_LDTR_SELECTOR setting (or any other GUEST_xxx_SELECTOR setting) basically has no effect when entering a VM, since the hardware will use the "cached" values programmed via GUEST_xxx_BASE, GUEST_xxx_LIMIT, GUEST_xxx_AR_BYTES instead.
So we are not really preserving the host's LDT selector during deprivileging, i.e. we are not really assuming that LDT matters at this point. If that's ok, then it might be ok to not do anything with it during reprivileging either...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extra context. So, I think we can just ignore LDT for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this is irrelavent with the "cached". The "cached" thing makes the host VM itself to use the LDT base/limit/ar programeed by the VMCS rather than the value loaded from its own GDT memory page, which is the unusable hidden part we programmed in the VMCS before deprivilege.
But here after a CPU is deprivileged and vmexit to the pKVM hypervisor to reprivilege, the LDT selector and hiddent part (base/limit/AR) switch to the pKVM hypervisor's value(in real, the pKVM hypervisor doesn't set LDT so its LDT selector is always 0 and LDT hidden part is also unusable). If we don't restore the host's LDT in the root mode, then the LDT will be remained as the root mode's value. We are good without restore the host's LDT selector because both sides are not using LDT so the LDT selector and base/limit/ar are the same for them. But if the pKVM hypervisor uses LDT after a vmexit, the root mode may use a different LDT hidden part with the host, then this seems will be problematic without restoring the host's LDT (which is unused) during reprivileging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the pKVM hypervisor uses LDT after a vmexit
Just wondering: do we foresee any use cases for using LDT in pKVM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering: do we foresee any use cases for using LDT in pKVM?
No, I think we don't foresee any use cases for that.
| return 0; | ||
|
|
||
| repriv_cpus: | ||
| pkvm_host_reprivilege_cpus(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, how about temporarily making pKVM initialization always fail (and thus trigger reprivileging), until pvVMCS is ready? So that: 1. we can enable pKVM without breaking KVM functionality. 2. we can easily test (teamfood) reprivileging functionality just by enabling pKVM?
Ideally we could do that as a part of the patch e636d62 which enables deprivileging (which means that reprivileging would need to be implemented even before that patch), so that there is no single commit that makes KVM non-workable. Although we probably don't need to make the order of patches so perfect for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea. I have added a separate "REVERTME" commit for this
702a706 to
f2cfebc
Compare
arch/x86/kvm/pkvm/vmx/host_repriv.c
Outdated
| wrmsrl(MSR_GS_BASE, hcs->gsbase); | ||
|
|
||
| wrmsrq(MSR_IA32_DEBUGCTLMSR, hcs->debugctl); | ||
| wrmsrq(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With such a testing hack:
--- a/arch/x86/kvm/vmx/pkvm_init.c
+++ b/arch/x86/kvm/vmx/pkvm_init.c
@@ -995,7 +995,7 @@ static __init void pkvm_host_deprivilege_cpu(void *data)
goto vmxoff;
}
- ret = local_deprivilege_cpu();
+ ret = (cpu == 7) ? -EFAULT : local_deprivilege_cpu();
if (ret) {
pr_err("CPU%d deprivilege failed, ret %d\n", cpu, ret);
goto vmxoff;
when running in qemu, it crashes during reprivileging:
[ 0.891677] pkvm: Failed to deprivilege CPU7: smp_call 0, deprivilege: -14
[ 0.893057] Oops: general protection fault, maybe for address 0x0: 0000 [#1] SMP NOPTI
[ 0.894874] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.18.0+ #178 PREEMPT(voluntary)
[ 0.896629] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 0.896970] RIP: 0010:pkvm_vmx_reprivilege_cpu__pkvm+0x309/0x690
[ 0.896970] Code: 0f 30 48 8b 05 68 53 bc 01 b9 d9 01 00 00 48 89 c2 48 c1 ea 20 0f 30 48 8b 05 5b 53 bc 01 b9 8f 03 00 00 48 89 c2 48 c1 ea 20 <0f> 30 48 8b 05 4e 53 bc 01 b9 74 01 00 00 48 89 c2 48 c1 ea 20 0f
[ 0.896970] RSP: 0000:ffff88817a026f08 EFLAGS: 00010046
[ 0.896970] RAX: 0000000000000000 RBX: ffff88817a02f120 RCX: 000000000000038f
[ 0.896970] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88817a02f120
[ 0.916488] RBP: ffff88817a026f18 R08: 0000000000000405 R09: 0000000000000000
[ 0.916488] R10: 0000000000000000 R11: ffffc90000003ff8 R12: ffffffff835009bd
[ 0.916488] R13: 0000000000000002 R14: ffff88817a02f000 R15: ffffffff83500970
[ 0.916488] FS: 0000000000000000(0000) GS:ffff8881f5eaa000(0000) knlGS:0000000000000000
[ 0.916488] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.916488] CR2: ffff88817ffff000 CR3: 000000000323a001 CR4: 0000000000772ef0
[ 0.916488] PKRU: 55555554
[ 0.916488] Call Trace:
[ 0.916488] Modules linked in:
[ 0.916488] ---[ end trace 0000000000000000 ]---
[ 0.916488] RIP: 0010:pkvm_vmx_reprivilege_cpu__pkvm+0x309/0x690
Debugging shows that it crashes on writing MSR_CORE_PERF_GLOBAL_CTRL. Apparently this MSR is not implemented by qemu. And I assume the hypervisor's extable is not workable at this point, since pkvm_init() has not even started.
On real hardware this MSR it implemented, so this crash is not observed.
Anyway, this trick fixes it:
--- a/arch/x86/kvm/pkvm/vmx/host_repriv.c
+++ b/arch/x86/kvm/pkvm/vmx/host_repriv.c
@@ -107,7 +107,8 @@ static inline void restore_host_special_regs(struct host_cpu_state *hcs)
wrmsrl(MSR_GS_BASE, hcs->gsbase);
wrmsrq(MSR_IA32_DEBUGCTLMSR, hcs->debugctl);
- wrmsrq(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl);
+ if (hcs->perf_global_ctrl)
+ wrmsrq(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl);
wrmsrq(MSR_IA32_SYSENTER_CS, hcs->sysenter_cs);
wrmsrq(MSR_IA32_SYSENTER_ESP, hcs->sysenter_esp);
wrmsrq(MSR_IA32_SYSENTER_EIP, hcs->sysenter_eip);
I suppose this fix is correct, since pKVM ensures that while running in the pKVM hypervisor, this MSR value is always 0 (see Kevin's patch 1604e08), so if it is 0 in the host as well, we don't need to restore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about this, I haven't tested much on qemu after reviving my minipc for testing. Your fix looks good to me. The harder but correct way is to check pmu version id >= 2, but I guess this simple fix would be best as pkvm would not theoretically be supported on older hardware and this fix be just for qemu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good cache. This reminds me that, the wrmsrl is going to deprecated, so it is better we can use wrmsrq instead.
And because MSR_CORE_PERF_GLOBAL_CTRL relies on some CPUID, and now the pKVM hypervisor can support the safe version of wrmsrq, this can also be fixed by using wrmsrq_safe rather than wrmsrq. To be noted, this should be done before we switching to the host's IDT and GS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can also be fixed by using
wrmsrq_saferather thanwrmsrq. To be noted, this should be done before we switching to the host's IDT and GS.
The problem is that, for example, in this test with failing local_deprivilege_cpu(), reprivileging is done too early, when pKVM's exception fixup is not workable yet. I've just made an experiment: replaced wrmsrq with wrmsrq_safe, but it still crashes, the same way. I haven't investigated why exactly, but I suppose at least because pKVM hasn't called sort_extable() yet (since the host hasn't even made the init hypercall to pKVM yet)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, just noticed that we indeed write MSRs after switching the IDT, not before (which explains why this #GP call trace is printed by the host kernel, i.e. why it is handled by the host's exception handler).
So tried this:
--- a/arch/x86/kvm/pkvm/vmx/host_repriv.c
+++ b/arch/x86/kvm/pkvm/vmx/host_repriv.c
@@ -88,6 +88,8 @@ static inline void restore_host_special_regs(struct host_cpu_state *hcs)
__pkvm_write_cr0(hcs->cr0);
__pkvm_write_cr3(hcs->cr3);
+ wrmsrq_safe(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl);
+
asm volatile (
"lgdt %0\n"
"lidt %1\n"
@@ -107,7 +109,6 @@ static inline void restore_host_special_regs(struct host_cpu_state *hcs)
wrmsrl(MSR_GS_BASE, hcs->gsbase);
wrmsrq(MSR_IA32_DEBUGCTLMSR, hcs->debugctl);
- wrmsrq(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl);
wrmsrq(MSR_IA32_SYSENTER_CS, hcs->sysenter_cs);
wrmsrq(MSR_IA32_SYSENTER_ESP, hcs->sysenter_esp);
wrmsrq(MSR_IA32_SYSENTER_EIP, hcs->sysenter_eip);
and this helps: kernel is booting successfully.
But still, is it really guaranteed to work if pKVM hasn't called sort_extable() yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Although the real hardware has the MSR_CORE_PERF_GLOBAL_CTRL, a QEMU vm can be implemented not to support it. Created PR #55 to fix this.
Meanwhile, also moved the exception table sort from pkvm_init_finalize to the host side before deprivilege in PR #55. As excpetion fixup doesn't rely on the initialization done by the pkvm_init_finalize, sort the exception table can be done earlier. So move it to the deprivilege phase to make the exception fixup mechanism functional once the deprivileged is done. With this, we can also use wrmsrq_safe to fix this issue without checking hcs->perf_global_ctrl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks. PR #55 looks good to me but I see it is not actually working as it should on real HW on hybrid CPU (i.e., on any CPUs we actually care about), I've added a comment in the PR about that. But that's hopefully easy to fix.
With that fixed, I guess we should do wrmsrq_safe(MSR_CORE_PERF_GLOBAL_CTRL, hcs->perf_global_ctrl) before switching IDT, that seems cleaner than my fix.
Maybe we can also move writing other MSRs (maybe except FS_BASE and GS_BASE) before switching IDT, together with MSR_CORE_PERF_GLOBAL_CTRL. Or maybe no need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #55 has been merged and squashed to pkvm-v6.18 branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks. I will rebase on that and fix the msr restore logic and update the PR soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved all MSR restoration above GDT restoration and used wrmsrq_safe for MSR_CORE_PERF_GLOBAL_CTRL.
407369e to
24beab8
Compare
init_finalize hypercall name is a slight misnomer - the hypercall basically initializes pKVM hypervisor and not just the finalization part. More over, we would be using the finalize terminology later when we handle undoing the initialization and reprivileging cpus in the case of initialization failure. So rename the hypercall and related functions in pKVM and host. Signed-off-by: Vineeth Pillai (Google) <[email protected]>
Mechanism to undo the initialization logic and reprivilege the cpus back to bare metal if pKVM initialization fails. Signed-off-by: Vineeth Pillai (Google) <[email protected]>
These macros will be used by reprivilege code as well and hence, move it to a common header file. Fixes: 321aa36 (pKVM: VMX: Setup the RSP and RIP fields for the pKVM hypervisor) Signed-off-by: Vineeth Pillai (Google) <[email protected]>
f2cfebc to
f3ea8f9
Compare
Reprivilege feature allows host to boot as bare metal in case of pKVM initialization failure. All deprivileged cpus(VMX non-root mode) will be transitioned back to VMX root mode and then to non-VMX mode so that host can boot without pKVM. Signed-off-by: Vineeth Pillai (Google) <[email protected]>
f3ea8f9 to
cfa2e47
Compare
Until pVMCS is ready, temporarily fail pKVM initialization so that the reprivilege code gets more test coverage. Signed-off-by: Vineeth Pillai (Google) <[email protected]>
cfa2e47 to
5ef129e
Compare
No description provided.