Skip to content

Commit bf328e2

Browse files
Like Xusean-jc
authored andcommitted
KVM: x86: Don't sync user-written TSC against startup values
The legacy API for setting the TSC is fundamentally broken, and only allows userspace to set a TSC "now", without any way to account for time lost between the calculation of the value, and the kernel eventually handling the ioctl. To work around this, KVM has a hack which, if a TSC is set with a value which is within a second's worth of the last TSC "written" to any vCPU in the VM, assumes that userspace actually intended the two TSC values to be in sync and adjusts the newly-written TSC value accordingly. Thus, when a VMM restores a guest after suspend or migration using the legacy API, the TSCs aren't necessarily *right*, but at least they're in sync. This trick falls down when restoring a guest which genuinely has been running for less time than the 1 second of imprecision KVM allows for in in the legacy API. On *creation*, the first vCPU starts its TSC counting from zero, and the subsequent vCPUs synchronize to that. But then when the VMM tries to restore a vCPU's intended TSC, because the VM has been alive for less than 1 second and KVM's default TSC value for new vCPU's is '0', the intended TSC is within a second of the last "written" TSC and KVM incorrectly adjusts the intended TSC in an attempt to synchronize. But further hacks can be piled onto KVM's existing hackish ABI, and declare that the *first* value written by *userspace* (on any vCPU) should not be subject to this "correction", i.e. KVM can assume that the first write from userspace is not an attempt to sync up with TSC values that only come from the kernel's default vCPU creation. To that end: Add a flag, kvm->arch.user_set_tsc, protected by kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in the VM *has* been set by userspace, and make the 1-second slop hack only trigger if user_set_tsc is already set. Note that userspace can explicitly request a *synchronization* of the TSC by writing zero. For the purpose of user_set_tsc, an explicit synchronization counts as "setting" the TSC, i.e. if userspace then subsequently writes an explicit non-zero value which happens to be within 1 second of the previous value, the new value will be "corrected". This behavior is deliberate, as treating explicit synchronization as "setting" the TSC preserves KVM's existing behaviour inasmuch as possible (KVM always applied the 1-second "correction" regardless of whether the write came from userspace vs. the kernel). Reported-by: Yong He <[email protected]> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423 Suggested-by: Oliver Upton <[email protected]> Original-by: Oliver Upton <[email protected]> Original-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> Tested-by: Yong He <[email protected]> Reviewed-by: Maxim Levitsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
1 parent 5914553 commit bf328e2

File tree

2 files changed

+25
-10
lines changed

2 files changed

+25
-10
lines changed

arch/x86/include/asm/kvm_host.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1332,6 +1332,7 @@ struct kvm_arch {
13321332
int nr_vcpus_matched_tsc;
13331333

13341334
u32 default_tsc_khz;
1335+
bool user_set_tsc;
13351336

13361337
seqcount_raw_spinlock_t pvclock_sc;
13371338
bool use_master_clock;

arch/x86/kvm/x86.c

Lines changed: 24 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2709,8 +2709,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
27092709
kvm_track_tsc_matching(vcpu);
27102710
}
27112711

2712-
static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
2712+
static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
27132713
{
2714+
u64 data = user_value ? *user_value : 0;
27142715
struct kvm *kvm = vcpu->kvm;
27152716
u64 offset, ns, elapsed;
27162717
unsigned long flags;
@@ -2725,25 +2726,37 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
27252726
if (vcpu->arch.virtual_tsc_khz) {
27262727
if (data == 0) {
27272728
/*
2728-
* detection of vcpu initialization -- need to sync
2729-
* with other vCPUs. This particularly helps to keep
2730-
* kvm_clock stable after CPU hotplug
2729+
* Force synchronization when creating a vCPU, or when
2730+
* userspace explicitly writes a zero value.
27312731
*/
27322732
synchronizing = true;
2733-
} else {
2733+
} else if (kvm->arch.user_set_tsc) {
27342734
u64 tsc_exp = kvm->arch.last_tsc_write +
27352735
nsec_to_cycles(vcpu, elapsed);
27362736
u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
27372737
/*
2738-
* Special case: TSC write with a small delta (1 second)
2739-
* of virtual cycle time against real time is
2740-
* interpreted as an attempt to synchronize the CPU.
2738+
* Here lies UAPI baggage: when a user-initiated TSC write has
2739+
* a small delta (1 second) of virtual cycle time against the
2740+
* previously set vCPU, we assume that they were intended to be
2741+
* in sync and the delta was only due to the racy nature of the
2742+
* legacy API.
2743+
*
2744+
* This trick falls down when restoring a guest which genuinely
2745+
* has been running for less time than the 1 second of imprecision
2746+
* which we allow for in the legacy API. In this case, the first
2747+
* value written by userspace (on any vCPU) should not be subject
2748+
* to this 'correction' to make it sync up with values that only
2749+
* come from the kernel's default vCPU creation. Make the 1-second
2750+
* slop hack only trigger if the user_set_tsc flag is already set.
27412751
*/
27422752
synchronizing = data < tsc_exp + tsc_hz &&
27432753
data + tsc_hz > tsc_exp;
27442754
}
27452755
}
27462756

2757+
if (user_value)
2758+
kvm->arch.user_set_tsc = true;
2759+
27472760
/*
27482761
* For a reliable TSC, we can match TSC offsets, and for an unstable
27492762
* TSC, we add elapsed time in this computation. We could let the
@@ -3870,7 +3883,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
38703883
break;
38713884
case MSR_IA32_TSC:
38723885
if (msr_info->host_initiated) {
3873-
kvm_synchronize_tsc(vcpu, data);
3886+
kvm_synchronize_tsc(vcpu, &data);
38743887
} else {
38753888
u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
38763889
adjust_tsc_offset_guest(vcpu, adj);
@@ -5629,6 +5642,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
56295642
tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
56305643
ns = get_kvmclock_base_ns();
56315644

5645+
kvm->arch.user_set_tsc = true;
56325646
__kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
56335647
raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
56345648

@@ -12055,7 +12069,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
1205512069
if (mutex_lock_killable(&vcpu->mutex))
1205612070
return;
1205712071
vcpu_load(vcpu);
12058-
kvm_synchronize_tsc(vcpu, 0);
12072+
kvm_synchronize_tsc(vcpu, NULL);
1205912073
vcpu_put(vcpu);
1206012074

1206112075
/* poll control enabled by default */

0 commit comments

Comments
 (0)