Skip to content

Commit 13ed5c4

Browse files
committed
cpuidle: teo: Skip getting the sleep length if wakeups are very frequent
Commit 6da8f9b ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases") attempted to reduce the governor overhead in some cases by making it avoid obtaining the sleep length (the time till the next timer event) which may be costly. Among other things, after the above commit, tick_nohz_get_sleep_length() was not called any more when idle state 0 was to be returned, which turned out to be problematic and the previous behavior in that respect was restored by commit 4b20b07 ("cpuidle: teo: Don't count non- existent intercepts"). However, commit 6da8f9b also caused the governor to avoid calling tick_nohz_get_sleep_length() on systems where idle state 0 is a "polling" one (that is, it is not really an idle state, but a loop continuously executed by the CPU) when the target residency of the idle state to be returned was low enough, so there was no practical need to refine the idle state selection in any way. This change was not removed by the other commit, so now on systems where idle state 0 is a "polling" one, tick_nohz_get_sleep_length() is called when idle state 0 is to be returned, but it is not called when a deeper idle state with sufficiently low target residency is to be returned. That is arguably confusing and inconsistent. Moreover, there is no specific reason why the behavior in question should depend on whether or not idle state 0 is a "polling" one. One way to address this would be to make the governor always call tick_nohz_get_sleep_length() to obtain the sleep length, but that would effectively mean reverting commit 6da8f9b and restoring the latency issue that was the reason for doing it. This approach is thus not particularly attractive. To address it differently, notice that if a CPU is woken up very often, this is not likely to be caused by timers in the first place (user space has a default timer slack of 50 us and there are relatively few timers with a deadline shorter than several microseconds in the kernel) and even if it were the case, the potential benefit from using a deep idle state would then be questionable for latency reasons. Therefore, if the majority of CPU wakeups occur within several microseconds, it can be assumed that all wakeups in that range are non-timer and the sleep length need not be determined. Accordingly, introduce a new metric for counting wakeups with the measured idle duration below RESIDENCY_THRESHOLD_NS and modify the idle state selection to skip the tick_nohz_get_sleep_length() invocation if idle state 0 has been selected or the target residency of the candidate idle state is below RESIDENCY_THRESHOLD_NS and the value of the new metric is at least 1/2 of the total event count. Since the above requires the measured idle duration to be determined every time, except for the cases when one of the safety nets has triggered in which the wakeup is counted as a hit in the deepest idle state idle residency range, update the handling of those cases to avoid skipping the idle duration computation when the CPU wakeup is "genuine". Signed-off-by: Rafael J. Wysocki <[email protected]> Link: https://patch.msgid.link/[email protected] Tested-by: Aboorva Devarajan <[email protected]> Tested-by: Christian Loehle <[email protected]> Reviewed-by: Christian Loehle <[email protected]> [ rjw: Renamed a struct field ] [ rjw: Fixed typo in the subject and one in a comment ] Signed-off-by: Rafael J. Wysocki <[email protected]>
1 parent d619b5c commit 13ed5c4

File tree

1 file changed

+36
-22
lines changed
  • drivers/cpuidle/governors

1 file changed

+36
-22
lines changed

drivers/cpuidle/governors/teo.c

Lines changed: 36 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -129,13 +129,15 @@ struct teo_bin {
129129
* @state_bins: Idle state data bins for this CPU.
130130
* @total: Grand total of the "intercepts" and "hits" metrics for all bins.
131131
* @tick_intercepts: "Intercepts" before TICK_NSEC.
132+
* @short_idles: Wakeups after short idle periods.
132133
*/
133134
struct teo_cpu {
134135
s64 time_span_ns;
135136
s64 sleep_length_ns;
136137
struct teo_bin state_bins[CPUIDLE_STATE_MAX];
137138
unsigned int total;
138139
unsigned int tick_intercepts;
140+
unsigned int short_idles;
139141
};
140142

141143
static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
@@ -152,12 +154,12 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
152154
s64 target_residency_ns;
153155
u64 measured_ns;
154156

155-
if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {
157+
cpu_data->short_idles -= cpu_data->short_idles >> DECAY_SHIFT;
158+
159+
if (cpu_data->time_span_ns < 0) {
156160
/*
157-
* This causes the wakeup to be counted as a hit regardless of
158-
* the real idle duration which doesn't need to be computed
159-
* because the wakeup has been close enough to an anticipated
160-
* timer.
161+
* If one of the safety nets has triggered, assume that this
162+
* might have been a long sleep.
161163
*/
162164
measured_ns = U64_MAX;
163165
} else {
@@ -177,10 +179,14 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
177179
* time, so take 1/2 of the exit latency as a very rough
178180
* approximation of the average of it.
179181
*/
180-
if (measured_ns >= lat_ns)
182+
if (measured_ns >= lat_ns) {
181183
measured_ns -= lat_ns / 2;
182-
else
184+
if (measured_ns < RESIDENCY_THRESHOLD_NS)
185+
cpu_data->short_idles += PULSE;
186+
} else {
183187
measured_ns /= 2;
188+
cpu_data->short_idles += PULSE;
189+
}
184190
}
185191

186192
cpu_data->total = 0;
@@ -419,27 +425,35 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
419425
if (idx > constraint_idx)
420426
idx = constraint_idx;
421427

422-
if (!idx) {
423-
/*
424-
* Query the sleep length to be able to count the wakeup as a
425-
* hit if it is caused by a timer.
426-
*/
427-
cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick);
428-
goto out_tick;
429-
}
430-
431428
/*
432-
* If state 0 is a polling one, check if the target residency of
433-
* the current candidate state is low enough and skip the timers
434-
* check in that case too.
429+
* If either the candidate state is state 0 or its target residency is
430+
* low enough, there is basically nothing more to do, but if the sleep
431+
* length is not updated, the subsequent wakeup will be counted as an
432+
* "intercept" which may be problematic in the cases when timer wakeups
433+
* are dominant. Namely, it may effectively prevent deeper idle states
434+
* from being selected at one point even if no imminent timers are
435+
* scheduled.
436+
*
437+
* However, frequent timers in the RESIDENCY_THRESHOLD_NS range on one
438+
* CPU are unlikely (user space has a default 50 us slack value for
439+
* hrtimers and there are relatively few timers with a lower deadline
440+
* value in the kernel), and even if they did happen, the potential
441+
* benefit from using a deep idle state in that case would be
442+
* questionable anyway for latency reasons. Thus if the measured idle
443+
* duration falls into that range in the majority of cases, assume
444+
* non-timer wakeups to be dominant and skip updating the sleep length
445+
* to reduce latency.
435446
*/
436-
if ((drv->states[0].flags & CPUIDLE_FLAG_POLLING) &&
437-
drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS)
447+
if ((!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) &&
448+
2 * cpu_data->short_idles >= cpu_data->total)
438449
goto out_tick;
439450

440451
duration_ns = tick_nohz_get_sleep_length(&delta_tick);
441452
cpu_data->sleep_length_ns = duration_ns;
442453

454+
if (!idx)
455+
goto out_tick;
456+
443457
/*
444458
* If the closest expected timer is before the target residency of the
445459
* candidate state, a shallower one needs to be found.
@@ -501,7 +515,7 @@ static void teo_reflect(struct cpuidle_device *dev, int state)
501515
if (dev->poll_time_limit ||
502516
(tick_nohz_idle_got_tick() && cpu_data->sleep_length_ns > TICK_NSEC)) {
503517
dev->poll_time_limit = false;
504-
cpu_data->time_span_ns = cpu_data->sleep_length_ns;
518+
cpu_data->time_span_ns = KTIME_MIN;
505519
} else {
506520
cpu_data->time_span_ns = local_clock() - cpu_data->time_span_ns;
507521
}

0 commit comments

Comments
 (0)