Skip to content

Commit 4c5744a

Browse files
committed
Merge branches 'pm-cpuidle' and 'pm-em'
* pm-cpuidle: cpuidle: Select polling interval based on a c-state with a longer target residency cpuidle: psci: Enable suspend-to-idle for PSCI OSI mode PM: domains: Enable dev_pm_genpd_suspend|resume() for suspend-to-idle PM: domains: Rename pm_genpd_syscore_poweroff|poweron() * pm-em: PM / EM: Micro optimization in em_cpu_energy PM: EM: Update Energy Model with new flag indicating power scale PM: EM: update the comments related to power scale PM: EM: Clarify abstract scale usage for power values in Energy Model
3 parents e1f1320 + 7a25759 + 1080399 commit 4c5744a

File tree

13 files changed

+154
-49
lines changed

13 files changed

+154
-49
lines changed

Documentation/driver-api/thermal/power_allocator.rst

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,9 @@ to the speed-grade of the silicon. `sustainable_power` is therefore
7171
simply an estimate, and may be tuned to affect the aggressiveness of
7272
the thermal ramp. For reference, the sustainable power of a 4" phone
7373
is typically 2000mW, while on a 10" tablet is around 4500mW (may vary
74-
depending on screen size).
74+
depending on screen size). It is possible to have the power value
75+
expressed in an abstract scale. The sustained power should be aligned
76+
to the scale used by the related cooling devices.
7577

7678
If you are using device tree, do add it as a property of the
7779
thermal-zone. For example::
@@ -269,3 +271,11 @@ won't be very good. Note that this is not particular to this
269271
governor, step-wise will also misbehave if you call its throttle()
270272
faster than the normal thermal framework tick (due to interrupts for
271273
example) as it will overreact.
274+
275+
Energy Model requirements
276+
=========================
277+
278+
Another important thing is the consistent scale of the power values
279+
provided by the cooling devices. All of the cooling devices in a single
280+
thermal zone should have power values reported either in milli-Watts
281+
or scaled to the same 'abstract scale'.

Documentation/power/energy-model.rst

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,21 @@ possible source of information on its own, the EM framework intervenes as an
2020
abstraction layer which standardizes the format of power cost tables in the
2121
kernel, hence enabling to avoid redundant work.
2222

23+
The power values might be expressed in milli-Watts or in an 'abstract scale'.
24+
Multiple subsystems might use the EM and it is up to the system integrator to
25+
check that the requirements for the power value scale types are met. An example
26+
can be found in the Energy-Aware Scheduler documentation
27+
Documentation/scheduler/sched-energy.rst. For some subsystems like thermal or
28+
powercap power values expressed in an 'abstract scale' might cause issues.
29+
These subsystems are more interested in estimation of power used in the past,
30+
thus the real milli-Watts might be needed. An example of these requirements can
31+
be found in the Intelligent Power Allocation in
32+
Documentation/driver-api/thermal/power_allocator.rst.
33+
Kernel subsystems might implement automatic detection to check whether EM
34+
registered devices have inconsistent scale (based on EM internal flag).
35+
Important thing to keep in mind is that when the power values are expressed in
36+
an 'abstract scale' deriving real energy in milli-Joules would not be possible.
37+
2338
The figure below depicts an example of drivers (Arm-specific here, but the
2439
approach is applicable to any architecture) providing power costs to the EM
2540
framework, and interested clients reading the data from it::
@@ -73,14 +88,18 @@ Drivers are expected to register performance domains into the EM framework by
7388
calling the following API::
7489

7590
int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
76-
struct em_data_callback *cb, cpumask_t *cpus);
91+
struct em_data_callback *cb, cpumask_t *cpus, bool milliwatts);
7792

7893
Drivers must provide a callback function returning <frequency, power> tuples
7994
for each performance state. The callback function provided by the driver is free
8095
to fetch data from any relevant location (DT, firmware, ...), and by any mean
8196
deemed necessary. Only for CPU devices, drivers must specify the CPUs of the
8297
performance domains using cpumask. For other devices than CPUs the last
8398
argument must be set to NULL.
99+
The last argument 'milliwatts' is important to set with correct value. Kernel
100+
subsystems which use EM might rely on this flag to check if all EM devices use
101+
the same scale. If there are different scales, these subsystems might decide
102+
to: return warning/error, stop working or panic.
84103
See Section 3. for an example of driver implementing this
85104
callback, and kernel/power/energy_model.c for further documentation on this
86105
API.
@@ -156,7 +175,8 @@ EM framework::
156175
37 nr_opp = foo_get_nr_opp(policy);
157176
38
158177
39 /* And register the new performance domain */
159-
40 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus);
160-
41
161-
42 return 0;
162-
43 }
178+
40 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus,
179+
41 true);
180+
42
181+
43 return 0;
182+
44 }

Documentation/scheduler/sched-energy.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -350,6 +350,11 @@ independent EM framework in Documentation/power/energy-model.rst.
350350
Please also note that the scheduling domains need to be re-built after the
351351
EM has been registered in order to start EAS.
352352

353+
EAS uses the EM to make a forecasting decision on energy usage and thus it is
354+
more focused on the difference when checking possible options for task
355+
placement. For EAS it doesn't matter whether the EM power values are expressed
356+
in milli-Watts or in an 'abstract scale'.
357+
353358

354359
6.3 - Energy Model complexity
355360
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

drivers/base/power/domain.c

Lines changed: 35 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1363,41 +1363,60 @@ static void genpd_complete(struct device *dev)
13631363
genpd_unlock(genpd);
13641364
}
13651365

1366-
/**
1367-
* genpd_syscore_switch - Switch power during system core suspend or resume.
1368-
* @dev: Device that normally is marked as "always on" to switch power for.
1369-
*
1370-
* This routine may only be called during the system core (syscore) suspend or
1371-
* resume phase for devices whose "always on" flags are set.
1372-
*/
1373-
static void genpd_syscore_switch(struct device *dev, bool suspend)
1366+
static void genpd_switch_state(struct device *dev, bool suspend)
13741367
{
13751368
struct generic_pm_domain *genpd;
1369+
bool use_lock;
13761370

13771371
genpd = dev_to_genpd_safe(dev);
13781372
if (!genpd)
13791373
return;
13801374

1375+
use_lock = genpd_is_irq_safe(genpd);
1376+
1377+
if (use_lock)
1378+
genpd_lock(genpd);
1379+
13811380
if (suspend) {
13821381
genpd->suspended_count++;
1383-
genpd_sync_power_off(genpd, false, 0);
1382+
genpd_sync_power_off(genpd, use_lock, 0);
13841383
} else {
1385-
genpd_sync_power_on(genpd, false, 0);
1384+
genpd_sync_power_on(genpd, use_lock, 0);
13861385
genpd->suspended_count--;
13871386
}
1387+
1388+
if (use_lock)
1389+
genpd_unlock(genpd);
13881390
}
13891391

1390-
void pm_genpd_syscore_poweroff(struct device *dev)
1392+
/**
1393+
* dev_pm_genpd_suspend - Synchronously try to suspend the genpd for @dev
1394+
* @dev: The device that is attached to the genpd, that can be suspended.
1395+
*
1396+
* This routine should typically be called for a device that needs to be
1397+
* suspended during the syscore suspend phase. It may also be called during
1398+
* suspend-to-idle to suspend a corresponding CPU device that is attached to a
1399+
* genpd.
1400+
*/
1401+
void dev_pm_genpd_suspend(struct device *dev)
13911402
{
1392-
genpd_syscore_switch(dev, true);
1403+
genpd_switch_state(dev, true);
13931404
}
1394-
EXPORT_SYMBOL_GPL(pm_genpd_syscore_poweroff);
1405+
EXPORT_SYMBOL_GPL(dev_pm_genpd_suspend);
13951406

1396-
void pm_genpd_syscore_poweron(struct device *dev)
1407+
/**
1408+
* dev_pm_genpd_resume - Synchronously try to resume the genpd for @dev
1409+
* @dev: The device that is attached to the genpd, which needs to be resumed.
1410+
*
1411+
* This routine should typically be called for a device that needs to be resumed
1412+
* during the syscore resume phase. It may also be called during suspend-to-idle
1413+
* to resume a corresponding CPU device that is attached to a genpd.
1414+
*/
1415+
void dev_pm_genpd_resume(struct device *dev)
13971416
{
1398-
genpd_syscore_switch(dev, false);
1417+
genpd_switch_state(dev, false);
13991418
}
1400-
EXPORT_SYMBOL_GPL(pm_genpd_syscore_poweron);
1419+
EXPORT_SYMBOL_GPL(dev_pm_genpd_resume);
14011420

14021421
#else /* !CONFIG_PM_SLEEP */
14031422

drivers/clocksource/sh_cmt.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -658,7 +658,7 @@ static void sh_cmt_clocksource_suspend(struct clocksource *cs)
658658
return;
659659

660660
sh_cmt_stop(ch, FLAG_CLOCKSOURCE);
661-
pm_genpd_syscore_poweroff(&ch->cmt->pdev->dev);
661+
dev_pm_genpd_suspend(&ch->cmt->pdev->dev);
662662
}
663663

664664
static void sh_cmt_clocksource_resume(struct clocksource *cs)
@@ -668,7 +668,7 @@ static void sh_cmt_clocksource_resume(struct clocksource *cs)
668668
if (!ch->cs_enabled)
669669
return;
670670

671-
pm_genpd_syscore_poweron(&ch->cmt->pdev->dev);
671+
dev_pm_genpd_resume(&ch->cmt->pdev->dev);
672672
sh_cmt_start(ch, FLAG_CLOCKSOURCE);
673673
}
674674

@@ -760,7 +760,7 @@ static void sh_cmt_clock_event_suspend(struct clock_event_device *ced)
760760
{
761761
struct sh_cmt_channel *ch = ced_to_sh_cmt(ced);
762762

763-
pm_genpd_syscore_poweroff(&ch->cmt->pdev->dev);
763+
dev_pm_genpd_suspend(&ch->cmt->pdev->dev);
764764
clk_unprepare(ch->cmt->clk);
765765
}
766766

@@ -769,7 +769,7 @@ static void sh_cmt_clock_event_resume(struct clock_event_device *ced)
769769
struct sh_cmt_channel *ch = ced_to_sh_cmt(ced);
770770

771771
clk_prepare(ch->cmt->clk);
772-
pm_genpd_syscore_poweron(&ch->cmt->pdev->dev);
772+
dev_pm_genpd_resume(&ch->cmt->pdev->dev);
773773
}
774774

775775
static int sh_cmt_register_clockevent(struct sh_cmt_channel *ch,

drivers/clocksource/sh_mtu2.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -297,12 +297,12 @@ static int sh_mtu2_clock_event_set_periodic(struct clock_event_device *ced)
297297

298298
static void sh_mtu2_clock_event_suspend(struct clock_event_device *ced)
299299
{
300-
pm_genpd_syscore_poweroff(&ced_to_sh_mtu2(ced)->mtu->pdev->dev);
300+
dev_pm_genpd_suspend(&ced_to_sh_mtu2(ced)->mtu->pdev->dev);
301301
}
302302

303303
static void sh_mtu2_clock_event_resume(struct clock_event_device *ced)
304304
{
305-
pm_genpd_syscore_poweron(&ced_to_sh_mtu2(ced)->mtu->pdev->dev);
305+
dev_pm_genpd_resume(&ced_to_sh_mtu2(ced)->mtu->pdev->dev);
306306
}
307307

308308
static void sh_mtu2_register_clockevent(struct sh_mtu2_channel *ch,

drivers/clocksource/sh_tmu.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -292,7 +292,7 @@ static void sh_tmu_clocksource_suspend(struct clocksource *cs)
292292

293293
if (--ch->enable_count == 0) {
294294
__sh_tmu_disable(ch);
295-
pm_genpd_syscore_poweroff(&ch->tmu->pdev->dev);
295+
dev_pm_genpd_suspend(&ch->tmu->pdev->dev);
296296
}
297297
}
298298

@@ -304,7 +304,7 @@ static void sh_tmu_clocksource_resume(struct clocksource *cs)
304304
return;
305305

306306
if (ch->enable_count++ == 0) {
307-
pm_genpd_syscore_poweron(&ch->tmu->pdev->dev);
307+
dev_pm_genpd_resume(&ch->tmu->pdev->dev);
308308
__sh_tmu_enable(ch);
309309
}
310310
}
@@ -394,12 +394,12 @@ static int sh_tmu_clock_event_next(unsigned long delta,
394394

395395
static void sh_tmu_clock_event_suspend(struct clock_event_device *ced)
396396
{
397-
pm_genpd_syscore_poweroff(&ced_to_sh_tmu(ced)->tmu->pdev->dev);
397+
dev_pm_genpd_suspend(&ced_to_sh_tmu(ced)->tmu->pdev->dev);
398398
}
399399

400400
static void sh_tmu_clock_event_resume(struct clock_event_device *ced)
401401
{
402-
pm_genpd_syscore_poweron(&ced_to_sh_tmu(ced)->tmu->pdev->dev);
402+
dev_pm_genpd_resume(&ced_to_sh_tmu(ced)->tmu->pdev->dev);
403403
}
404404

405405
static void sh_tmu_register_clockevent(struct sh_tmu_channel *ch,

drivers/cpuidle/cpuidle-psci-domain.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -327,6 +327,8 @@ struct device *psci_dt_attach_cpu(int cpu)
327327
if (cpu_online(cpu))
328328
pm_runtime_get_sync(dev);
329329

330+
dev_pm_syscore_device(dev, true);
331+
330332
return dev;
331333
}
332334

drivers/cpuidle/cpuidle-psci.c

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
#include <linux/of_device.h>
2020
#include <linux/platform_device.h>
2121
#include <linux/psci.h>
22+
#include <linux/pm_domain.h>
2223
#include <linux/pm_runtime.h>
2324
#include <linux/slab.h>
2425
#include <linux/string.h>
@@ -52,8 +53,9 @@ static inline int psci_enter_state(int idx, u32 state)
5253
return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter, idx, state);
5354
}
5455

55-
static int psci_enter_domain_idle_state(struct cpuidle_device *dev,
56-
struct cpuidle_driver *drv, int idx)
56+
static int __psci_enter_domain_idle_state(struct cpuidle_device *dev,
57+
struct cpuidle_driver *drv, int idx,
58+
bool s2idle)
5759
{
5860
struct psci_cpuidle_data *data = this_cpu_ptr(&psci_cpuidle_data);
5961
u32 *states = data->psci_states;
@@ -66,15 +68,25 @@ static int psci_enter_domain_idle_state(struct cpuidle_device *dev,
6668
return -1;
6769

6870
/* Do runtime PM to manage a hierarchical CPU toplogy. */
69-
RCU_NONIDLE(pm_runtime_put_sync_suspend(pd_dev));
71+
rcu_irq_enter_irqson();
72+
if (s2idle)
73+
dev_pm_genpd_suspend(pd_dev);
74+
else
75+
pm_runtime_put_sync_suspend(pd_dev);
76+
rcu_irq_exit_irqson();
7077

7178
state = psci_get_domain_state();
7279
if (!state)
7380
state = states[idx];
7481

7582
ret = psci_cpu_suspend_enter(state) ? -1 : idx;
7683

77-
RCU_NONIDLE(pm_runtime_get_sync(pd_dev));
84+
rcu_irq_enter_irqson();
85+
if (s2idle)
86+
dev_pm_genpd_resume(pd_dev);
87+
else
88+
pm_runtime_get_sync(pd_dev);
89+
rcu_irq_exit_irqson();
7890

7991
cpu_pm_exit();
8092

@@ -83,6 +95,19 @@ static int psci_enter_domain_idle_state(struct cpuidle_device *dev,
8395
return ret;
8496
}
8597

98+
static int psci_enter_domain_idle_state(struct cpuidle_device *dev,
99+
struct cpuidle_driver *drv, int idx)
100+
{
101+
return __psci_enter_domain_idle_state(dev, drv, idx, false);
102+
}
103+
104+
static int psci_enter_s2idle_domain_idle_state(struct cpuidle_device *dev,
105+
struct cpuidle_driver *drv,
106+
int idx)
107+
{
108+
return __psci_enter_domain_idle_state(dev, drv, idx, true);
109+
}
110+
86111
static int psci_idle_cpuhp_up(unsigned int cpu)
87112
{
88113
struct device *pd_dev = __this_cpu_read(psci_cpuidle_data.dev);
@@ -170,6 +195,7 @@ static int psci_dt_cpu_init_topology(struct cpuidle_driver *drv,
170195
* deeper states.
171196
*/
172197
drv->states[state_count - 1].enter = psci_enter_domain_idle_state;
198+
drv->states[state_count - 1].enter_s2idle = psci_enter_s2idle_domain_idle_state;
173199
psci_cpuidle_use_cpuhp = true;
174200

175201
return 0;

drivers/cpuidle/cpuidle.c

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -368,6 +368,19 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
368368
cpuidle_curr_governor->reflect(dev, index);
369369
}
370370

371+
/*
372+
* Min polling interval of 10usec is a guess. It is assuming that
373+
* for most users, the time for a single ping-pong workload like
374+
* perf bench pipe would generally complete within 10usec but
375+
* this is hardware dependant. Actual time can be estimated with
376+
*
377+
* perf bench sched pipe -l 10000
378+
*
379+
* Run multiple times to avoid cpufreq effects.
380+
*/
381+
#define CPUIDLE_POLL_MIN 10000
382+
#define CPUIDLE_POLL_MAX (TICK_NSEC / 16)
383+
371384
/**
372385
* cpuidle_poll_time - return amount of time to poll for,
373386
* governors can override dev->poll_limit_ns if necessary
@@ -382,15 +395,23 @@ u64 cpuidle_poll_time(struct cpuidle_driver *drv,
382395
int i;
383396
u64 limit_ns;
384397

398+
BUILD_BUG_ON(CPUIDLE_POLL_MIN > CPUIDLE_POLL_MAX);
399+
385400
if (dev->poll_limit_ns)
386401
return dev->poll_limit_ns;
387402

388-
limit_ns = TICK_NSEC;
403+
limit_ns = CPUIDLE_POLL_MAX;
389404
for (i = 1; i < drv->state_count; i++) {
405+
u64 state_limit;
406+
390407
if (dev->states_usage[i].disable)
391408
continue;
392409

393-
limit_ns = drv->states[i].target_residency_ns;
410+
state_limit = drv->states[i].target_residency_ns;
411+
if (state_limit < CPUIDLE_POLL_MIN)
412+
continue;
413+
414+
limit_ns = min_t(u64, state_limit, CPUIDLE_POLL_MAX);
394415
break;
395416
}
396417

0 commit comments

Comments
 (0)