Skip to content

Commit 108ad09

Browse files
Suleiman SouhlalPeter Zijlstra
authored andcommitted
sched: Don't try to catch up excess steal time.
When steal time exceeds the measured delta when updating clock_task, we currently try to catch up the excess in future updates. However, this results in inaccurate run times for the future things using clock_task, in some situations, as they end up getting additional steal time that did not actually happen. This is because there is a window between reading the elapsed time in update_rq_clock() and sampling the steal time in update_rq_clock_task(). If the VCPU gets preempted between those two points, any additional steal time is accounted to the outgoing task even though the calculated delta did not actually contain any of that "stolen" time. When this race happens, we can end up with steal time that exceeds the calculated delta, and the previous code would try to catch up that excess steal time in future clock updates, which is given to the next, incoming task, even though it did not actually have any time stolen. This behavior is particularly bad when steal time can be very long, which we've seen when trying to extend steal time to contain the duration that the host was suspended [0]. When this happens, clock_task stays frozen, during which the running task stays running for the whole duration, since its run time doesn't increase. However the race can happen even under normal operation. Ideally we would read the elapsed cpu time and the steal time atomically, to prevent this race from happening in the first place, but doing so is non-trivial. Since the time between those two points isn't otherwise accounted anywhere, neither to the outgoing task nor the incoming task (because the "end of outgoing task" and "start of incoming task" timestamps are the same), I would argue that the right thing to do is to simply drop any excess steal time, in order to prevent these issues. [0] https://lore.kernel.org/kvm/[email protected]/ Signed-off-by: Suleiman Souhlal <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent 82f9cc0 commit 108ad09

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

kernel/sched/core.c

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
766766
#endif
767767
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
768768
if (static_key_false((&paravirt_steal_rq_enabled))) {
769-
steal = paravirt_steal_clock(cpu_of(rq));
769+
u64 prev_steal;
770+
771+
steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
770772
steal -= rq->prev_steal_time_rq;
771773

772774
if (unlikely(steal > delta))
773775
steal = delta;
774776

775-
rq->prev_steal_time_rq += steal;
777+
rq->prev_steal_time_rq = prev_steal;
776778
delta -= steal;
777779
}
778780
#endif

0 commit comments

Comments
 (0)