Skip to content
Open
13 changes: 7 additions & 6 deletions Documentation/scheduler/sched-capacity.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,15 @@ per Hz, leading to::
-------------------

Two different capacity values are used within the scheduler. A CPU's
``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
which some loss of available performance (e.g. time spent handling IRQs) is
subtracted.
``original capacity`` is its maximum attainable capacity, i.e. its maximum
attainable performance level. This original capacity is returned by
the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
capacity`` to which some loss of available performance (e.g. time spent
handling IRQs) is subtracted.

Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
while ``capacity_orig`` is class-agnostic. The rest of this document will use
the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
while ``original capacity`` is class-agnostic. The rest of this document will use
the term ``capacity`` interchangeably with ``original capacity`` for the sake of
brevity.

1.3 Platform examples
Expand Down
7 changes: 3 additions & 4 deletions Documentation/scheduler/schedutil.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,17 +90,16 @@ For more detail see:
- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


UTIL_EST / UTIL_EST_FASTUP
==========================
UTIL_EST
========

Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
(DVFS) ramp-up after they are running again.

To alleviate this (a default enabled option) UTIL_EST drives an Infinite
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
filter to instantly increase and only decay on decrease.
highest. UTIL_EST filters to instantly increase and only decay on decrease.

A further runqueue wide sum (of runnable tasks) is maintained of:

Expand Down
7 changes: 3 additions & 4 deletions Documentation/translations/zh_CN/scheduler/schedutil.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,16 +89,15 @@ r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最
- Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


UTIL_EST / UTIL_EST_FASTUP
==========================
UTIL_EST
========

由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同,
因此它们在再次运行后会面临(DVFS)的上涨。

为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应
(Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。
另一个默认使能的编译选项UTIL_EST_FASTUP修改了IIR滤波器,使其允许立即增加,
仅在利用率下降时衰减。
UTIL_EST滤波使其在遇到更高值时立刻增加,而遇到低值时会缓慢衰减。

进一步,运行队列的(可运行任务的)利用率之和由下式计算:

Expand Down
1 change: 1 addition & 0 deletions arch/arm/include/asm/topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref
#endif

/* Replace task scheduler's default cpu-invariant accounting */
Expand Down
1 change: 1 addition & 0 deletions arch/arm64/include/asm/topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ void update_freq_counters_refs(void);
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref

#ifdef CONFIG_ACPI_CPPC_LIB
#define arch_init_invariance_cppc topology_init_cpu_capacity_cppc
Expand Down
1 change: 1 addition & 0 deletions arch/riscv/include/asm/topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref

/* Replace task scheduler's default cpu-invariant accounting */
#define arch_scale_cpu_capacity topology_get_cpu_scale
Expand Down
1 change: 1 addition & 0 deletions arch/x86/kernel/cpu/aperfmperf.c
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,7 @@ static DECLARE_WORK(disable_freq_invariance_work,
disable_freq_invariance_workfn);

DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
EXPORT_PER_CPU_SYMBOL_GPL(arch_freq_scale);

static void scale_freq_tick(u64 acnt, u64 mcnt)
{
Expand Down
42 changes: 20 additions & 22 deletions drivers/base/arch_topology.c
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,16 @@
#include <linux/init.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/units.h>

#define CREATE_TRACE_POINTS
#include <trace/events/thermal_pressure.h>

static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
static struct cpumask scale_freq_counters_mask;
static bool scale_freq_invariant;
static DEFINE_PER_CPU(u32, freq_factor) = 1;
DEFINE_PER_CPU(unsigned long, capacity_freq_ref) = 1;
EXPORT_PER_CPU_SYMBOL_GPL(capacity_freq_ref);

static bool supports_scale_freq_counters(const struct cpumask *cpus)
{
Expand Down Expand Up @@ -170,9 +172,9 @@ DEFINE_PER_CPU(unsigned long, thermal_pressure);
* operating on stale data when hot-plug is used for some CPUs. The
* @capped_freq reflects the currently allowed max CPUs frequency due to
* thermal capping. It might be also a boost frequency value, which is bigger
* than the internal 'freq_factor' max frequency. In such case the pressure
* value should simply be removed, since this is an indication that there is
* no thermal throttling. The @capped_freq must be provided in kHz.
* than the internal 'capacity_freq_ref' max frequency. In such case the
* pressure value should simply be removed, since this is an indication that
* there is no thermal throttling. The @capped_freq must be provided in kHz.
*/
void topology_update_thermal_pressure(const struct cpumask *cpus,
unsigned long capped_freq)
Expand All @@ -183,10 +185,7 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,

cpu = cpumask_first(cpus);
max_capacity = arch_scale_cpu_capacity(cpu);
max_freq = per_cpu(freq_factor, cpu);

/* Convert to MHz scale which is used in 'freq_factor' */
capped_freq /= 1000;
max_freq = arch_scale_freq_ref(cpu);

/*
* Handle properly the boost frequencies, which should simply clean
Expand Down Expand Up @@ -279,13 +278,13 @@ void topology_normalize_cpu_scale(void)

capacity_scale = 1;
for_each_possible_cpu(cpu) {
capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
capacity_scale = max(capacity, capacity_scale);
}

pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale);
for_each_possible_cpu(cpu) {
capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
capacity_scale);
topology_set_cpu_scale(cpu, capacity);
Expand Down Expand Up @@ -321,15 +320,15 @@ bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
cpu_node, raw_capacity[cpu]);

/*
* Update freq_factor for calculating early boot cpu capacities.
* Update capacity_freq_ref for calculating early boot CPU capacities.
* For non-clk CPU DVFS mechanism, there's no way to get the
* frequency value now, assuming they are running at the same
* frequency (by keeping the initial freq_factor value).
* frequency (by keeping the initial capacity_freq_ref value).
*/
cpu_clk = of_clk_get(cpu_node, 0);
if (!PTR_ERR_OR_ZERO(cpu_clk)) {
per_cpu(freq_factor, cpu) =
clk_get_rate(cpu_clk) / 1000;
per_cpu(capacity_freq_ref, cpu) =
clk_get_rate(cpu_clk) / HZ_PER_KHZ;
clk_put(cpu_clk);
}
} else {
Expand Down Expand Up @@ -398,9 +397,6 @@ init_cpu_capacity_callback(struct notifier_block *nb,
struct cpufreq_policy *policy = data;
int cpu;

if (!raw_capacity)
return 0;

if (val != CPUFREQ_CREATE_POLICY)
return 0;

Expand All @@ -411,12 +407,14 @@ init_cpu_capacity_callback(struct notifier_block *nb,
cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);

for_each_cpu(cpu, policy->related_cpus)
per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000;
per_cpu(capacity_freq_ref, cpu) = policy->cpuinfo.max_freq;

if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale();
schedule_work(&update_topology_flags_work);
free_raw_capacity();
if (raw_capacity) {
topology_normalize_cpu_scale();
schedule_work(&update_topology_flags_work);
free_raw_capacity();
}
pr_debug("cpu_capacity: parsing done\n");
schedule_work(&parsing_done_work);
}
Expand All @@ -436,7 +434,7 @@ static int __init register_cpufreq_notifier(void)
* On ACPI-based systems skip registering cpufreq notifier as cpufreq
* information is not needed for cpu capacity initialization.
*/
if (!acpi_disabled || !raw_capacity)
if (!acpi_disabled)
return -EINVAL;

if (!alloc_cpumask_var(&cpus_to_visit, GFP_KERNEL))
Expand Down
4 changes: 2 additions & 2 deletions drivers/cpufreq/cpufreq.c
Original file line number Diff line number Diff line change
Expand Up @@ -454,7 +454,7 @@ void cpufreq_freq_transition_end(struct cpufreq_policy *policy,

arch_set_freq_scale(policy->related_cpus,
policy->cur,
policy->cpuinfo.max_freq);
arch_scale_freq_ref(policy->cpu));

spin_lock(&policy->transition_lock);
policy->transition_ongoing = false;
Expand Down Expand Up @@ -2188,7 +2188,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,

policy->cur = freq;
arch_set_freq_scale(policy->related_cpus, freq,
policy->cpuinfo.max_freq);
arch_scale_freq_ref(policy->cpu));
cpufreq_stats_record_transition(policy, freq);

if (trace_cpu_frequency_enabled()) {
Expand Down
7 changes: 7 additions & 0 deletions include/linux/arch_topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,13 @@ static inline unsigned long topology_get_cpu_scale(int cpu)

void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);

DECLARE_PER_CPU(unsigned long, capacity_freq_ref);

static inline unsigned long topology_get_freq_ref(int cpu)
{
return per_cpu(capacity_freq_ref, cpu);
}

DECLARE_PER_CPU(unsigned long, arch_freq_scale);

static inline unsigned long topology_get_freq_scale(int cpu)
Expand Down
1 change: 1 addition & 0 deletions include/linux/cpufreq.h
Original file line number Diff line number Diff line change
Expand Up @@ -1220,6 +1220,7 @@ void arch_set_freq_scale(const struct cpumask *cpus,
{
}
#endif

/* the following are really really optional */
extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
Expand Down
6 changes: 3 additions & 3 deletions include/linux/energy_model.h
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
unsigned long max_util, unsigned long sum_util,
unsigned long allowed_cpu_cap)
{
unsigned long freq, scale_cpu;
unsigned long freq, ref_freq, scale_cpu;
struct em_perf_state *ps;
int cpu;

Expand All @@ -241,11 +241,11 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
*/
cpu = cpumask_first(to_cpumask(pd->cpus));
scale_cpu = arch_scale_cpu_capacity(cpu);
ps = &pd->table[pd->nr_perf_states - 1];
ref_freq = arch_scale_freq_ref(cpu);

max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap);
freq = map_util_freq(max_util, ps->frequency, scale_cpu);
freq = map_util_freq(max_util, ref_freq, scale_cpu);

/*
* Find the lowest performance state of the Energy Model above the
Expand Down
49 changes: 12 additions & 37 deletions include/linux/sched.h
Original file line number Diff line number Diff line change
Expand Up @@ -416,42 +416,6 @@ struct load_weight {
u32 inv_weight;
};

/**
* struct util_est - Estimation utilization of FAIR tasks
* @enqueued: instantaneous estimated utilization of a task/cpu
* @ewma: the Exponential Weighted Moving Average (EWMA)
* utilization of a task
*
* Support data structure to track an Exponential Weighted Moving Average
* (EWMA) of a FAIR task's utilization. New samples are added to the moving
* average each time a task completes an activation. Sample's weight is chosen
* so that the EWMA will be relatively insensitive to transient changes to the
* task's workload.
*
* The enqueued attribute has a slightly different meaning for tasks and cpus:
* - task: the task's util_avg at last task dequeue time
* - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
* Thus, the util_est.enqueued of a task represents the contribution on the
* estimated utilization of the CPU where that task is currently enqueued.
*
* Only for tasks we track a moving average of the past instantaneous
* estimated utilization. This allows to absorb sporadic drops in utilization
* of an otherwise almost periodic task.
*
* The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
* updates. When a task is dequeued, its util_est should not be updated if its
* util_avg has not been updated in the meantime.
* This information is mapped into the MSB bit of util_est.enqueued at dequeue
* time. Since max value of util_est.enqueued for a task is 1024 (PELT util_avg
* for a task) it is safe to use MSB.
*/
struct util_est {
unsigned int enqueued;
unsigned int ewma;
#define UTIL_EST_WEIGHT_SHIFT 2
#define UTIL_AVG_UNCHANGED 0x80000000
} __attribute__((__aligned__(sizeof(u64))));

/*
* The load/runnable/util_avg accumulates an infinite geometric series
* (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
Expand Down Expand Up @@ -506,11 +470,22 @@ struct sched_avg {
unsigned long load_avg;
unsigned long runnable_avg;
unsigned long util_avg;
struct util_est util_est;
unsigned int util_est;
DEEPIN_KABI_RESERVE(1)
DEEPIN_KABI_RESERVE(2)
} ____cacheline_aligned;

/*
* The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
* updates. When a task is dequeued, its util_est should not be updated if its
* util_avg has not been updated in the meantime.
* This information is mapped into the MSB bit of util_est at dequeue time.
* Since max value of util_est for a task is 1024 (PELT util_avg for a task)
* it is safe to use MSB.
*/
#define UTIL_EST_WEIGHT_SHIFT 2
#define UTIL_AVG_UNCHANGED 0x80000000

struct sched_statistics {
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
Expand Down
8 changes: 8 additions & 0 deletions include/linux/sched/topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,14 @@ void arch_update_thermal_pressure(const struct cpumask *cpus,
{ }
#endif

#ifndef arch_scale_freq_ref
static __always_inline
unsigned int arch_scale_freq_ref(int cpu)
{
return 0;
}
#endif

static inline int task_node(const struct task_struct *p)
{
return cpu_to_node(task_cpu(p));
Expand Down
2 changes: 1 addition & 1 deletion kernel/sched/core.c
Original file line number Diff line number Diff line change
Expand Up @@ -10047,7 +10047,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
rq->sd = NULL;
rq->rd = NULL;
rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
rq->cpu_capacity = SCHED_CAPACITY_SCALE;
rq->balance_callback = &balance_push_callback;
rq->active_balance = 0;
rq->next_balance = jiffies;
Expand Down
2 changes: 1 addition & 1 deletion kernel/sched/cpudeadline.c
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
if (!dl_task_fits_capacity(p, cpu)) {
cpumask_clear_cpu(cpu, later_mask);

cap = capacity_orig_of(cpu);
cap = arch_scale_cpu_capacity(cpu);

if (cap > max_cap ||
(cpu == task_cpu(p) && cap == max_cap)) {
Expand Down
Loading
Loading