Skip to content

Commit 46609ce

Browse files
Qais YousefPeter Zijlstra
authored andcommitted
sched/uclamp: Protect uclamp fast path code with static key
There is a report that when uclamp is enabled, a netperf UDP test regresses compared to a kernel compiled without uclamp. https://lore.kernel.org/lkml/[email protected]/ While investigating the root cause, there were no sign that the uclamp code is doing anything particularly expensive but could suffer from bad cache behavior under certain circumstances that are yet to be understood. https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/ To reduce the pressure on the fast path anyway, add a static key that is by default will skip executing uclamp logic in the enqueue/dequeue_task() fast path until it's needed. As soon as the user start using util clamp by: 1. Changing uclamp value of a task with sched_setattr() 2. Modifying the default sysctl_sched_util_clamp_{min, max} 3. Modifying the default cpu.uclamp.{min, max} value in cgroup We flip the static key now that the user has opted to use util clamp. Effectively re-introducing uclamp logic in the enqueue/dequeue_task() fast path. It stays on from that point forward until the next reboot. This should help minimize the effect of util clamp on workloads that don't need it but still allow distros to ship their kernels with uclamp compiled in by default. SCHED_WARN_ON() in uclamp_rq_dec_id() was removed since now we can end up with unbalanced call to uclamp_rq_dec_id() if we flip the key while a task is running in the rq. Since we know it is harmless we just quietly return if we attempt a uclamp_rq_dec_id() when rq->uclamp[].bucket[].tasks is 0. In schedutil, we introduce a new uclamp_is_enabled() helper which takes the static key into account to ensure RT boosting behavior is retained. The following results demonstrates how this helps on 2 Sockets Xeon E5 2x10-Cores system. nouclamp uclamp uclamp-static-key Hmean send-64 162.43 ( 0.00%) 157.84 * -2.82%* 163.39 * 0.59%* Hmean send-128 324.71 ( 0.00%) 314.78 * -3.06%* 326.18 * 0.45%* Hmean send-256 641.55 ( 0.00%) 628.67 * -2.01%* 648.12 * 1.02%* Hmean send-1024 2525.28 ( 0.00%) 2448.26 * -3.05%* 2543.73 * 0.73%* Hmean send-2048 4836.14 ( 0.00%) 4712.08 * -2.57%* 4867.69 * 0.65%* Hmean send-3312 7540.83 ( 0.00%) 7425.45 * -1.53%* 7621.06 * 1.06%* Hmean send-4096 9124.53 ( 0.00%) 8948.82 * -1.93%* 9276.25 * 1.66%* Hmean send-8192 15589.67 ( 0.00%) 15486.35 * -0.66%* 15819.98 * 1.48%* Hmean send-16384 26386.47 ( 0.00%) 25752.25 * -2.40%* 26773.74 * 1.47%* The perf diff between nouclamp and uclamp-static-key when uclamp is disabled in the fast path: 8.73% -1.55% [kernel.kallsyms] [k] try_to_wake_up 0.07% +0.04% [kernel.kallsyms] [k] deactivate_task 0.13% -0.02% [kernel.kallsyms] [k] activate_task The diff between nouclamp and uclamp-static-key when uclamp is enabled in the fast path: 8.73% -0.72% [kernel.kallsyms] [k] try_to_wake_up 0.13% +0.39% [kernel.kallsyms] [k] activate_task 0.07% +0.38% [kernel.kallsyms] [k] deactivate_task Fixes: 69842cb ("sched/uclamp: Add CPU's clamp buckets refcounting") Reported-by: Mel Gorman <[email protected]> Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Lukasz Luba <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
1 parent d81ae8a commit 46609ce

File tree

3 files changed

+119
-4
lines changed

3 files changed

+119
-4
lines changed

kernel/sched/core.c

Lines changed: 73 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -796,6 +796,26 @@ unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
796796
/* All clamps are required to be less or equal than these values */
797797
static struct uclamp_se uclamp_default[UCLAMP_CNT];
798798

799+
/*
800+
* This static key is used to reduce the uclamp overhead in the fast path. It
801+
* primarily disables the call to uclamp_rq_{inc, dec}() in
802+
* enqueue/dequeue_task().
803+
*
804+
* This allows users to continue to enable uclamp in their kernel config with
805+
* minimum uclamp overhead in the fast path.
806+
*
807+
* As soon as userspace modifies any of the uclamp knobs, the static key is
808+
* enabled, since we have an actual users that make use of uclamp
809+
* functionality.
810+
*
811+
* The knobs that would enable this static key are:
812+
*
813+
* * A task modifying its uclamp value with sched_setattr().
814+
* * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs.
815+
* * An admin modifying the cgroup cpu.uclamp.{min, max}
816+
*/
817+
DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);
818+
799819
/* Integer rounded range for each bucket */
800820
#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
801821

@@ -992,10 +1012,38 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
9921012

9931013
lockdep_assert_held(&rq->lock);
9941014

1015+
/*
1016+
* If sched_uclamp_used was enabled after task @p was enqueued,
1017+
* we could end up with unbalanced call to uclamp_rq_dec_id().
1018+
*
1019+
* In this case the uc_se->active flag should be false since no uclamp
1020+
* accounting was performed at enqueue time and we can just return
1021+
* here.
1022+
*
1023+
* Need to be careful of the following enqeueue/dequeue ordering
1024+
* problem too
1025+
*
1026+
* enqueue(taskA)
1027+
* // sched_uclamp_used gets enabled
1028+
* enqueue(taskB)
1029+
* dequeue(taskA)
1030+
* // Must not decrement bukcet->tasks here
1031+
* dequeue(taskB)
1032+
*
1033+
* where we could end up with stale data in uc_se and
1034+
* bucket[uc_se->bucket_id].
1035+
*
1036+
* The following check here eliminates the possibility of such race.
1037+
*/
1038+
if (unlikely(!uc_se->active))
1039+
return;
1040+
9951041
bucket = &uc_rq->bucket[uc_se->bucket_id];
1042+
9961043
SCHED_WARN_ON(!bucket->tasks);
9971044
if (likely(bucket->tasks))
9981045
bucket->tasks--;
1046+
9991047
uc_se->active = false;
10001048

10011049
/*
@@ -1023,6 +1071,15 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
10231071
{
10241072
enum uclamp_id clamp_id;
10251073

1074+
/*
1075+
* Avoid any overhead until uclamp is actually used by the userspace.
1076+
*
1077+
* The condition is constructed such that a NOP is generated when
1078+
* sched_uclamp_used is disabled.
1079+
*/
1080+
if (!static_branch_unlikely(&sched_uclamp_used))
1081+
return;
1082+
10261083
if (unlikely(!p->sched_class->uclamp_enabled))
10271084
return;
10281085

@@ -1038,6 +1095,15 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
10381095
{
10391096
enum uclamp_id clamp_id;
10401097

1098+
/*
1099+
* Avoid any overhead until uclamp is actually used by the userspace.
1100+
*
1101+
* The condition is constructed such that a NOP is generated when
1102+
* sched_uclamp_used is disabled.
1103+
*/
1104+
if (!static_branch_unlikely(&sched_uclamp_used))
1105+
return;
1106+
10411107
if (unlikely(!p->sched_class->uclamp_enabled))
10421108
return;
10431109

@@ -1146,8 +1212,10 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
11461212
update_root_tg = true;
11471213
}
11481214

1149-
if (update_root_tg)
1215+
if (update_root_tg) {
1216+
static_branch_enable(&sched_uclamp_used);
11501217
uclamp_update_root_tg();
1218+
}
11511219

11521220
/*
11531221
* We update all RUNNABLE tasks only when task groups are in use.
@@ -1212,6 +1280,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
12121280
if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
12131281
return;
12141282

1283+
static_branch_enable(&sched_uclamp_used);
1284+
12151285
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
12161286
uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
12171287
attr->sched_util_min, true);
@@ -7436,6 +7506,8 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
74367506
if (req.ret)
74377507
return req.ret;
74387508

7509+
static_branch_enable(&sched_uclamp_used);
7510+
74397511
mutex_lock(&uclamp_mutex);
74407512
rcu_read_lock();
74417513

kernel/sched/cpufreq_schedutil.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -210,7 +210,7 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
210210
unsigned long dl_util, util, irq;
211211
struct rq *rq = cpu_rq(cpu);
212212

213-
if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
213+
if (!uclamp_is_used() &&
214214
type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
215215
return max;
216216
}

kernel/sched/sched.h

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -878,6 +878,8 @@ struct uclamp_rq {
878878
unsigned int value;
879879
struct uclamp_bucket bucket[UCLAMP_BUCKETS];
880880
};
881+
882+
DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
881883
#endif /* CONFIG_UCLAMP_TASK */
882884

883885
/*
@@ -2365,12 +2367,35 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
23652367
#ifdef CONFIG_UCLAMP_TASK
23662368
unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
23672369

2370+
/**
2371+
* uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
2372+
* @rq: The rq to clamp against. Must not be NULL.
2373+
* @util: The util value to clamp.
2374+
* @p: The task to clamp against. Can be NULL if you want to clamp
2375+
* against @rq only.
2376+
*
2377+
* Clamps the passed @util to the max(@rq, @p) effective uclamp values.
2378+
*
2379+
* If sched_uclamp_used static key is disabled, then just return the util
2380+
* without any clamping since uclamp aggregation at the rq level in the fast
2381+
* path is disabled, rendering this operation a NOP.
2382+
*
2383+
* Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
2384+
* will return the correct effective uclamp value of the task even if the
2385+
* static key is disabled.
2386+
*/
23682387
static __always_inline
23692388
unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
23702389
struct task_struct *p)
23712390
{
2372-
unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
2373-
unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
2391+
unsigned long min_util;
2392+
unsigned long max_util;
2393+
2394+
if (!static_branch_likely(&sched_uclamp_used))
2395+
return util;
2396+
2397+
min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
2398+
max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
23742399

23752400
if (p) {
23762401
min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN));
@@ -2387,13 +2412,31 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
23872412

23882413
return clamp(util, min_util, max_util);
23892414
}
2415+
2416+
/*
2417+
* When uclamp is compiled in, the aggregation at rq level is 'turned off'
2418+
* by default in the fast path and only gets turned on once userspace performs
2419+
* an operation that requires it.
2420+
*
2421+
* Returns true if userspace opted-in to use uclamp and aggregation at rq level
2422+
* hence is active.
2423+
*/
2424+
static inline bool uclamp_is_used(void)
2425+
{
2426+
return static_branch_likely(&sched_uclamp_used);
2427+
}
23902428
#else /* CONFIG_UCLAMP_TASK */
23912429
static inline
23922430
unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
23932431
struct task_struct *p)
23942432
{
23952433
return util;
23962434
}
2435+
2436+
static inline bool uclamp_is_used(void)
2437+
{
2438+
return false;
2439+
}
23972440
#endif /* CONFIG_UCLAMP_TASK */
23982441

23992442
#ifdef arch_scale_freq_capacity

0 commit comments

Comments
 (0)