Skip to content

Commit f418371

Browse files
changhuaixinPeter Zijlstra
authored andcommitted
sched/fair: Introduce the burstable CFS controller
The CFS bandwidth controller limits CPU requests of a task group to quota during each period. However, parallel workloads might be bursty so that they get throttled even when their average utilization is under quota. And they are latency sensitive at the same time so that throttling them is undesired. We borrow time now against our future underrun, at the cost of increased interference against the other system users. All nicely bounded. Traditional (UP-EDF) bandwidth control is something like: (U = \Sum u_i) <= 1 This guaranteeds both that every deadline is met and that the system is stable. After all, if U were > 1, then for every second of walltime, we'd have to run more than a second of program time, and obviously miss our deadline, but the next deadline will be further out still, there is never time to catch up, unbounded fail. This work observes that a workload doesn't always executes the full quota; this enables one to describe u_i as a statistical distribution. For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) (the traditional WCET). This effectively allows u to be smaller, increasing the efficiency (we can pack more tasks in the system), but at the cost of missing deadlines when all the odds line up. However, it does maintain stability, since every overrun must be paired with an underrun as long as our x is above the average. That is, suppose we have 2 tasks, both specify a p(95) value, then we have a p(95)*p(95) = 90.25% chance both tasks are within their quota and everything is good. At the same time we have a p(5)p(5) = 0.25% chance both tasks will exceed their quota at the same time (guaranteed deadline fail). Somewhere in between there's a threshold where one exceeds and the other doesn't underrun enough to compensate; this depends on the specific CDFs. At the same time, we can say that the worst case deadline miss, will be \Sum e_i; that is, there is a bounded tardiness (under the assumption that x+e is indeed WCET). The benefit of burst is seen when testing with schbench. Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used. mkdir /sys/fs/cgroup/cpu/test echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us ./schbench -m 1 -t 3 -r 20 -c 80000 -R 10 The average CPU usage is at 80%. I run this for 10 times, and got long tail latency for 6 times and got throttled for 8 times. Tail latencies are shown below, and it wasn't the worst case. Latency percentiles (usec) 50.0000th: 19872 75.0000th: 21344 90.0000th: 22176 95.0000th: 22496 *99.0000th: 22752 99.5000th: 22752 99.9000th: 22752 min=0, max=22727 rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44% The interferenece when using burst is valued by the possibilities for missing the deadline and the average WCET. Test results showed that when there many cgroups or CPU is under utilized, the interference is limited. More details are shown in: https://lore.kernel.org/lkml/[email protected]/ Co-developed-by: Shanpei Chen <[email protected]> Signed-off-by: Shanpei Chen <[email protected]> Co-developed-by: Tianchen Ding <[email protected]> Signed-off-by: Tianchen Ding <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ben Segall <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent 0213b70 commit f418371

File tree

3 files changed

+73
-10
lines changed

3 files changed

+73
-10
lines changed

kernel/sched/core.c

Lines changed: 62 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9780,7 +9780,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
97809780

97819781
static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
97829782

9783-
static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
9783+
static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
9784+
u64 burst)
97849785
{
97859786
int i, ret = 0, runtime_enabled, runtime_was_enabled;
97869787
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -9810,6 +9811,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
98109811
if (quota != RUNTIME_INF && quota > max_cfs_runtime)
98119812
return -EINVAL;
98129813

9814+
if (quota != RUNTIME_INF && (burst > quota ||
9815+
burst + quota > max_cfs_runtime))
9816+
return -EINVAL;
9817+
98139818
/*
98149819
* Prevent race between setting of cfs_rq->runtime_enabled and
98159820
* unthrottle_offline_cfs_rqs().
@@ -9831,6 +9836,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
98319836
raw_spin_lock_irq(&cfs_b->lock);
98329837
cfs_b->period = ns_to_ktime(period);
98339838
cfs_b->quota = quota;
9839+
cfs_b->burst = burst;
98349840

98359841
__refill_cfs_bandwidth_runtime(cfs_b);
98369842

@@ -9864,17 +9870,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
98649870

98659871
static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
98669872
{
9867-
u64 quota, period;
9873+
u64 quota, period, burst;
98689874

98699875
period = ktime_to_ns(tg->cfs_bandwidth.period);
9876+
burst = tg->cfs_bandwidth.burst;
98709877
if (cfs_quota_us < 0)
98719878
quota = RUNTIME_INF;
98729879
else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
98739880
quota = (u64)cfs_quota_us * NSEC_PER_USEC;
98749881
else
98759882
return -EINVAL;
98769883

9877-
return tg_set_cfs_bandwidth(tg, period, quota);
9884+
return tg_set_cfs_bandwidth(tg, period, quota, burst);
98789885
}
98799886

98809887
static long tg_get_cfs_quota(struct task_group *tg)
@@ -9892,15 +9899,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
98929899

98939900
static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
98949901
{
9895-
u64 quota, period;
9902+
u64 quota, period, burst;
98969903

98979904
if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
98989905
return -EINVAL;
98999906

99009907
period = (u64)cfs_period_us * NSEC_PER_USEC;
99019908
quota = tg->cfs_bandwidth.quota;
9909+
burst = tg->cfs_bandwidth.burst;
99029910

9903-
return tg_set_cfs_bandwidth(tg, period, quota);
9911+
return tg_set_cfs_bandwidth(tg, period, quota, burst);
99049912
}
99059913

99069914
static long tg_get_cfs_period(struct task_group *tg)
@@ -9913,6 +9921,30 @@ static long tg_get_cfs_period(struct task_group *tg)
99139921
return cfs_period_us;
99149922
}
99159923

9924+
static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
9925+
{
9926+
u64 quota, period, burst;
9927+
9928+
if ((u64)cfs_burst_us > U64_MAX / NSEC_PER_USEC)
9929+
return -EINVAL;
9930+
9931+
burst = (u64)cfs_burst_us * NSEC_PER_USEC;
9932+
period = ktime_to_ns(tg->cfs_bandwidth.period);
9933+
quota = tg->cfs_bandwidth.quota;
9934+
9935+
return tg_set_cfs_bandwidth(tg, period, quota, burst);
9936+
}
9937+
9938+
static long tg_get_cfs_burst(struct task_group *tg)
9939+
{
9940+
u64 burst_us;
9941+
9942+
burst_us = tg->cfs_bandwidth.burst;
9943+
do_div(burst_us, NSEC_PER_USEC);
9944+
9945+
return burst_us;
9946+
}
9947+
99169948
static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
99179949
struct cftype *cft)
99189950
{
@@ -9937,6 +9969,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
99379969
return tg_set_cfs_period(css_tg(css), cfs_period_us);
99389970
}
99399971

9972+
static u64 cpu_cfs_burst_read_u64(struct cgroup_subsys_state *css,
9973+
struct cftype *cft)
9974+
{
9975+
return tg_get_cfs_burst(css_tg(css));
9976+
}
9977+
9978+
static int cpu_cfs_burst_write_u64(struct cgroup_subsys_state *css,
9979+
struct cftype *cftype, u64 cfs_burst_us)
9980+
{
9981+
return tg_set_cfs_burst(css_tg(css), cfs_burst_us);
9982+
}
9983+
99409984
struct cfs_schedulable_data {
99419985
struct task_group *tg;
99429986
u64 period, quota;
@@ -10089,6 +10133,11 @@ static struct cftype cpu_legacy_files[] = {
1008910133
.read_u64 = cpu_cfs_period_read_u64,
1009010134
.write_u64 = cpu_cfs_period_write_u64,
1009110135
},
10136+
{
10137+
.name = "cfs_burst_us",
10138+
.read_u64 = cpu_cfs_burst_read_u64,
10139+
.write_u64 = cpu_cfs_burst_write_u64,
10140+
},
1009210141
{
1009310142
.name = "stat",
1009410143
.seq_show = cpu_cfs_stat_show,
@@ -10254,12 +10303,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
1025410303
{
1025510304
struct task_group *tg = css_tg(of_css(of));
1025610305
u64 period = tg_get_cfs_period(tg);
10306+
u64 burst = tg_get_cfs_burst(tg);
1025710307
u64 quota;
1025810308
int ret;
1025910309

1026010310
ret = cpu_period_quota_parse(buf, &period, &quota);
1026110311
if (!ret)
10262-
ret = tg_set_cfs_bandwidth(tg, period, quota);
10312+
ret = tg_set_cfs_bandwidth(tg, period, quota, burst);
1026310313
return ret ?: nbytes;
1026410314
}
1026510315
#endif
@@ -10286,6 +10336,12 @@ static struct cftype cpu_files[] = {
1028610336
.seq_show = cpu_max_show,
1028710337
.write = cpu_max_write,
1028810338
},
10339+
{
10340+
.name = "max.burst",
10341+
.flags = CFTYPE_NOT_ON_ROOT,
10342+
.read_u64 = cpu_cfs_burst_read_u64,
10343+
.write_u64 = cpu_cfs_burst_write_u64,
10344+
},
1028910345
#endif
1029010346
#ifdef CONFIG_UCLAMP_TASK_GROUP
1029110347
{

kernel/sched/fair.c

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4626,8 +4626,11 @@ static inline u64 sched_cfs_bandwidth_slice(void)
46264626
*/
46274627
void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
46284628
{
4629-
if (cfs_b->quota != RUNTIME_INF)
4630-
cfs_b->runtime = cfs_b->quota;
4629+
if (unlikely(cfs_b->quota == RUNTIME_INF))
4630+
return;
4631+
4632+
cfs_b->runtime += cfs_b->quota;
4633+
cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
46314634
}
46324635

46334636
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4988,15 +4991,16 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
49884991
throttled = !list_empty(&cfs_b->throttled_cfs_rq);
49894992
cfs_b->nr_periods += overrun;
49904993

4994+
/* Refill extra burst quota even if cfs_b->idle */
4995+
__refill_cfs_bandwidth_runtime(cfs_b);
4996+
49914997
/*
49924998
* idle depends on !throttled (for the case of a large deficit), and if
49934999
* we're going inactive then everything else can be deferred
49945000
*/
49955001
if (cfs_b->idle && !throttled)
49965002
goto out_deactivate;
49975003

4998-
__refill_cfs_bandwidth_runtime(cfs_b);
4999-
50005004
if (!throttled) {
50015005
/* mark as potentially idle for the upcoming period */
50025006
cfs_b->idle = 1;
@@ -5246,6 +5250,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
52465250
if (new < max_cfs_quota_period) {
52475251
cfs_b->period = ns_to_ktime(new);
52485252
cfs_b->quota *= 2;
5253+
cfs_b->burst *= 2;
52495254

52505255
pr_warn_ratelimited(
52515256
"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
@@ -5277,6 +5282,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
52775282
cfs_b->runtime = 0;
52785283
cfs_b->quota = RUNTIME_INF;
52795284
cfs_b->period = ns_to_ktime(default_cfs_period());
5285+
cfs_b->burst = 0;
52805286

52815287
INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
52825288
hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);

kernel/sched/sched.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -366,6 +366,7 @@ struct cfs_bandwidth {
366366
ktime_t period;
367367
u64 quota;
368368
u64 runtime;
369+
u64 burst;
369370
s64 hierarchical_quota;
370371

371372
u8 idle;

0 commit comments

Comments
 (0)