Skip to content

Commit e41bc79

Browse files
roypatpb8o
authored andcommitted
fix: Always restore MSR_IA32_TSC_DEADLINE after MSR_IA32_TSC
When restoring MSRs, we currently do not enforce any sort of relative ordering. However, some MSRs have implicit dependencies on other MSRs being restored before them, and failing to fulfill these dependencies can result in incorrect VM behavior after resuming. One example of such an implicit dependency between MSRs is the pair (`MSR_IA32_TSC_DEADLINE`, `MSR_IA32_TSC`). When restoring `MSR_IA32_TSC_DEADLINE`, KVM internally checks whether the value of this restoration implies that the guest was waiting for the tsc_deadline timer to expire at the time of being paused for snapshotting. If yes, it primes a (either harddware or software depending on support) timer on the host to make sure the guest will receive the expected interrupt after restoration. To determine whether this is needed, KVM looks at the guest's timestamp counter (TSC) and compares it with the requested tsc_deadline value. The value KVM reads for the guest's TSC depends on the value of MSR_IA32_TSC. Thus, if MSR_IA32_TSC_DEADLINE is set before MSR_IA32_TSC is restored, this comparison will yield a wrong result (as the deadline value is compared with something uninitialized). This can either result in KVM determining the guest wasn't waiting for a timing expiry at the time of snapshotting, or cause it to schedule the timer interrupt too far into the future. Note that the former is only a problem on AMD platforms, which do not support the TSC_DEADLINE feature at the hardware level. Here, KVM falls back to a software timer, which explicitly does not notify the vCPU if the deadline value is "in the past". The hardware timer used on other x86 platforms on the other hand always fires (potentially firing immediately if the deadline value is in the past). This commit fixes the above by ensuring we always restore MSR_IA32_TSC before restoring MSR_IA32_TSC_DEADLINE. We realize this by splitting the lists of MSRs that KVM gives us into one additional chunk containing all "deferred" MSRs that needs to be restored "as late as possible". This splitting happens at snapshot creation time, to get it off the hot-path. Fixes #4099 Signed-off-by: Patrick Roy <[email protected]>
1 parent 9103fba commit e41bc79

File tree

1 file changed

+68
-1
lines changed

1 file changed

+68
-1
lines changed

src/vmm/src/vstate/vcpu/x86_64.rs

Lines changed: 68 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ use kvm_bindings::{
1515
use kvm_ioctls::{VcpuExit, VcpuFd};
1616
use log::{error, warn};
1717
use serde::{Deserialize, Serialize};
18+
use utils::fam;
1819

1920
use crate::arch::x86_64::interrupts;
2021
use crate::arch::x86_64::msr::{create_boot_msr_entries, MsrError};
@@ -33,6 +34,16 @@ use crate::vstate::vm::Vm;
3334
const TSC_KHZ_TOL_NUMERATOR: i64 = 250;
3435
const TSC_KHZ_TOL_DENOMINATOR: i64 = 1_000_000;
3536

37+
/// A set of MSRs that should be restored separately after all other MSRs have already been restored
38+
const DEFERRED_MSRS: [u32; 1] = [
39+
// MSR_IA32_TSC_DEADLINE must be restored after MSR_IA32_TSC, otherwise we risk "losing" timer
40+
// interrupts across the snapshot restore boundary (due to KVM querying MSR_IA32_TSC upon
41+
// writes to the TSC_DEADLINE MSR to determine whether it needs to prime a timer - if
42+
// MSR_IA32_TSC is not initialized correctly, it can wrongly assume no timer needs to be
43+
// primed, or the timer can be initialized with a wrong expiry).
44+
MSR_IA32_TSC_DEADLINE,
45+
];
46+
3647
/// Errors associated with the wrappers over KVM ioctls.
3748
#[derive(Debug, PartialEq, Eq, thiserror::Error, displaydoc::Display)]
3849
pub enum KvmVcpuError {
@@ -316,6 +327,39 @@ impl KvmVcpu {
316327
}
317328
}
318329

330+
/// Looks for MSRs from the [`DEFERRED_MSRS`] array and removes them from `msr_chunks`.
331+
/// Returns a new [`Msrs`] object containing all the removed MSRs.
332+
///
333+
/// We use this to capture some causal dependencies between MSRs where the relative order
334+
/// of restoration matters (e.g. MSR_IA32_TSC must be restored before MSR_IA32_TSC_DEADLINE).
335+
fn extract_deferred_msrs(msr_chunks: &mut [Msrs]) -> Result<Msrs, fam::Error> {
336+
// Use 0 here as FamStructWrapper doesn't really give an equivalent of `Vec::with_capacity`,
337+
// and if we specify something N != 0 here, then it will create a FamStructWrapper with N
338+
// elements pre-allocated and zero'd out. Unless we then actually "fill" all those N values,
339+
// KVM will later yell at us about invalid MSRs.
340+
let mut deferred_msrs = Msrs::new(0)?;
341+
342+
for msrs in msr_chunks {
343+
msrs.retain(|msr| {
344+
if DEFERRED_MSRS.contains(&msr.index) {
345+
deferred_msrs
346+
.push(*msr)
347+
.inspect_err(|err| {
348+
error!(
349+
"Failed to move MSR {} into later chunk: {:?}",
350+
msr.index, err
351+
)
352+
})
353+
.is_err()
354+
} else {
355+
true
356+
}
357+
});
358+
}
359+
360+
Ok(deferred_msrs)
361+
}
362+
319363
/// Get MSR chunks for the given MSR index list.
320364
///
321365
/// KVM only supports getting `KVM_MAX_MSR_ENTRIES` at a time, so we divide
@@ -338,7 +382,8 @@ impl KvmVcpu {
338382
) -> Result<Vec<Msrs>, KvmVcpuError> {
339383
let num_chunks = msr_index_iter.len().div_ceil(KVM_MAX_MSR_ENTRIES);
340384

341-
let mut msr_chunks: Vec<Msrs> = Vec::with_capacity(num_chunks);
385+
// + 1 for the chunk of deferred MSRs
386+
let mut msr_chunks: Vec<Msrs> = Vec::with_capacity(num_chunks + 1);
342387

343388
for _ in 0..num_chunks {
344389
let chunk_len = msr_index_iter.len().min(KVM_MAX_MSR_ENTRIES);
@@ -348,6 +393,9 @@ impl KvmVcpu {
348393

349394
Self::fix_zero_tsc_deadline_msr(&mut msr_chunks);
350395

396+
let deferred = Self::extract_deferred_msrs(&mut msr_chunks)?;
397+
msr_chunks.push(deferred);
398+
351399
Ok(msr_chunks)
352400
}
353401

@@ -1046,6 +1094,25 @@ mod tests {
10461094
}
10471095
}
10481096

1097+
#[test]
1098+
fn test_defer_msrs() {
1099+
let to_defer = DEFERRED_MSRS[0];
1100+
1101+
let mut msr_chunks = [msrs_from_entries(&[(to_defer, 0), (MSR_IA32_TSC, 1)])];
1102+
1103+
let deferred = KvmVcpu::extract_deferred_msrs(&mut msr_chunks).unwrap();
1104+
1105+
assert_eq!(deferred.as_slice().len(), 1, "did not correctly defer MSR");
1106+
assert_eq!(
1107+
msr_chunks[0].as_slice().len(),
1108+
1,
1109+
"deferred MSR not removed from chunk"
1110+
);
1111+
1112+
assert_eq!(deferred.as_slice()[0].index, to_defer);
1113+
assert_eq!(msr_chunks[0].as_slice()[0].index, MSR_IA32_TSC);
1114+
}
1115+
10491116
#[test]
10501117
fn test_fix_zero_tsc_deadline_msr_zero_same_chunk() {
10511118
// Place both TSC and TSC_DEADLINE MSRs in the same chunk.

0 commit comments

Comments
 (0)