Skip to content

Commit 0408497

Browse files
committed
Merge tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki: "The most significant change here is the extension of the Energy Model to cover non-CPU devices (as well as CPUs) from Lukasz Luba. There is also some new hardware support (Ice Lake server idle states table for intel_idle, Sapphire Rapids and Power Limit 4 support in the RAPL driver), some new functionality in the existing drivers (eg. a new switch to disable/enable CPU energy-efficiency optimizations in intel_pstate, delayed timers in devfreq), some assorted fixes (cpufreq core, intel_pstate, intel_idle) and cleanups (eg. cpuidle-psci, devfreq), including the elimination of W=1 build warnings from cpufreq done by Lee Jones. Specifics: - Make the Energy Model cover non-CPU devices (Lukasz Luba). - Add Ice Lake server idle states table to the intel_idle driver and eliminate a redundant static variable from it (Chen Yu, Rafael Wysocki). - Eliminate all W=1 build warnings from cpufreq (Lee Jones). - Add support for Sapphire Rapids and for Power Limit 4 to the Intel RAPL power capping driver (Sumeet Pawnikar, Zhang Rui). - Fix function name in kerneldoc comments in the idle_inject power capping driver (Yangtao Li). - Fix locking issues with cpufreq governors and drop a redundant "weak" function definition from cpufreq (Viresh Kumar). - Rearrange cpufreq to register non-modular governors at the core_initcall level and allow the default cpufreq governor to be specified in the kernel command line (Quentin Perret). - Extend, fix and clean up the intel_pstate driver (Srinivas Pandruvada, Rafael Wysocki): * Add a new sysfs attribute for disabling/enabling CPU energy-efficiency optimizations in the processor. * Make the driver avoid enabling HWP if EPP is not supported. * Allow the driver to handle numeric EPP values in the sysfs interface and fix the setting of EPP via sysfs in the active mode. * Eliminate a static checker warning and clean up a kerneldoc comment. - Clean up some variable declarations in the powernv cpufreq driver (Wei Yongjun). - Fix up the ->enter_s2idle callback definition to cover the case when it points to the same function as ->idle correctly (Neal Liu). - Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson). - Make the PM core emit "changed" uevent when adding/removing the "wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi). - Add a helper macro for declaring PM callbacks and use it in the MMC jz4740 driver (Paul Cercueil). - Fix white space in some places in the hibernate code and make the system-wide PM code use "const char *" where appropriate (Xiang Chen, Alexey Dobriyan). - Add one more "unsafe" helper macro to the freezer to cover the NFS use case (He Zhe). - Change the language in the generic PM domains framework to use parent/child terminology and clean up a typo and some comment fromatting in that code (Kees Cook, Geert Uytterhoeven). - Update the operating performance points OPP framework (Lukasz Luba, Andrew-sh.Cheng, Valdis Kletnieks): * Refactor dev_pm_opp_of_register_em() and update related drivers. * Add a missing function export. * Allow disabled OPPs in dev_pm_opp_get_freq(). - Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier): * Add support for delayed timers to the devfreq core and make the Samsung exynos5422-dmc driver use it. * Unify sysfs interface to use "df-" as a prefix in instance names consistently. * Fix devfreq_summary debugfs node indentation. * Add the rockchip,pmu phandle to the rk3399_dmc driver DT bindings. * List Dmitry Osipenko as the Tegra devfreq driver maintainer. * Fix typos in the core devfreq code. - Update the pm-graph utility to version 5.7 including a number of fixes related to suspend-to-idle (Todd Brandt). - Fix coccicheck errors and warnings in the cpupower utility (Shuah Khan). - Replace HTTP links with HTTPs ones in multiple places (Alexander A. Klimov)" * tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits) cpuidle: ACPI: fix 'return' with no value build warning cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode cpufreq: intel_pstate: Rearrange the storing of new EPP values intel_idle: Customize IceLake server support PM / devfreq: Fix the wrong end with semicolon PM / devfreq: Fix indentaion of devfreq_summary debugfs node PM / devfreq: Clean up the devfreq instance name in sysfs attr memory: samsung: exynos5422-dmc: Add module param to control IRQ mode memory: samsung: exynos5422-dmc: Adjust polling interval and uptreshold memory: samsung: exynos5422-dmc: Use delayed timer as default PM / devfreq: Add support delayed timer for polling mode dt-bindings: devfreq: rk3399_dmc: Add rockchip,pmu phandle PM / devfreq: tegra: Add Dmitry as a maintainer PM / devfreq: event: Fix trivial spelling PM / devfreq: rk3399_dmc: Fix kernel oops when rockchip,pmu is absent cpuidle: change enter_s2idle() prototype cpuidle: psci: Prevent domain idlestates until consumers are ready cpuidle: psci: Convert PM domain to platform driver cpuidle: psci: Fix error path via converting to a platform driver cpuidle: psci: Fail cpuidle registration if set OSI mode failed ...
2 parents d516840 + 86ba54f commit 0408497

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

81 files changed

+1596
-962
lines changed

Documentation/ABI/testing/sysfs-class-devfreq

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,3 +108,15 @@ Description:
108108
frequency requested by governors and min_freq.
109109
The max_freq overrides min_freq because max_freq may be
110110
used to throttle devices to avoid overheating.
111+
112+
What: /sys/class/devfreq/.../timer
113+
Date: July 2020
114+
Contact: Chanwoo Choi <[email protected]>
115+
Description:
116+
This ABI shows and stores the kind of work timer by users.
117+
This work timer is used by devfreq workqueue in order to
118+
monitor the device status such as utilization. The user
119+
can change the work timer on runtime according to their demand
120+
as following:
121+
echo deferrable > /sys/class/devfreq/.../timer
122+
echo delayed > /sys/class/devfreq/.../timer

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -703,6 +703,11 @@
703703
cpufreq.off=1 [CPU_FREQ]
704704
disable the cpufreq sub-system
705705

706+
cpufreq.default_governor=
707+
[CPU_FREQ] Name of the default cpufreq governor or
708+
policy to use. This governor must be registered in the
709+
kernel before the cpufreq driver probes.
710+
706711
cpu_init_udelay=N
707712
[X86] Delay for N microsec between assert and de-assert
708713
of APIC INIT to start processors. This delay occurs

Documentation/admin-guide/pm/cpufreq.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -147,9 +147,9 @@ CPUs in it.
147147

148148
The next major initialization step for a new policy object is to attach a
149149
scaling governor to it (to begin with, that is the default scaling governor
150-
determined by the kernel configuration, but it may be changed later
151-
via ``sysfs``). First, a pointer to the new policy object is passed to the
152-
governor's ``->init()`` callback which is expected to initialize all of the
150+
determined by the kernel command line or configuration, but it may be changed
151+
later via ``sysfs``). First, a pointer to the new policy object is passed to
152+
the governor's ``->init()`` callback which is expected to initialize all of the
153153
data structures necessary to handle the given policy and, possibly, to add
154154
a governor ``sysfs`` interface to it. Next, the governor is started by
155155
invoking its ``->start()`` callback.

Documentation/admin-guide/pm/intel_pstate.rst

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -431,6 +431,17 @@ argument is passed to the kernel in the command line.
431431
supported in the current configuration, writes to this attribute will
432432
fail with an appropriate error.
433433

434+
``energy_efficiency``
435+
This attribute is only present on platforms, which have CPUs matching
436+
Kaby Lake or Coffee Lake desktop CPU model. By default
437+
energy efficiency optimizations are disabled on these CPU models in HWP
438+
mode by this driver. Enabling energy efficiency may limit maximum
439+
operating frequency in both HWP and non HWP mode. In non HWP mode,
440+
optimizations are done only in the turbo frequency range. In HWP mode,
441+
optimizations are done in the entire frequency range. Setting this
442+
attribute to "1" enables energy efficiency optimizations and setting
443+
to "0" disables energy efficiency optimizations.
444+
434445
Interpretation of Policy Attributes
435446
-----------------------------------
436447

@@ -554,7 +565,11 @@ somewhere between the two extremes:
554565
Strings written to the ``energy_performance_preference`` attribute are
555566
internally translated to integer values written to the processor's
556567
Energy-Performance Preference (EPP) knob (if supported) or its
557-
Energy-Performance Bias (EPB) knob.
568+
Energy-Performance Bias (EPB) knob. It is also possible to write a positive
569+
integer value between 0 to 255, if the EPP feature is present. If the EPP
570+
feature is not present, writing integer value to this attribute is not
571+
supported. In this case, user can use
572+
"/sys/devices/system/cpu/cpu*/power/energy_perf_bias" interface.
558573

559574
[Note that tasks may by migrated from one CPU to another by the scheduler's
560575
load-balancing algorithm and if different energy vs performance hints are

Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ Optional properties:
1818
format depends on the interrupt controller.
1919
It should be a DCF interrupt. When DDR DVFS finishes
2020
a DCF interrupt is triggered.
21+
- rockchip,pmu: Phandle to the syscon managing the "PMU general register
22+
files".
2123

2224
Following properties relate to DDR timing:
2325

Documentation/power/energy-model.rst

Lines changed: 75 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,17 @@
1-
====================
2-
Energy Model of CPUs
3-
====================
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=======================
4+
Energy Model of devices
5+
=======================
46

57
1. Overview
68
-----------
79

810
The Energy Model (EM) framework serves as an interface between drivers knowing
9-
the power consumed by CPUs at various performance levels, and the kernel
11+
the power consumed by devices at various performance levels, and the kernel
1012
subsystems willing to use that information to make energy-aware decisions.
1113

12-
The source of the information about the power consumed by CPUs can vary greatly
14+
The source of the information about the power consumed by devices can vary greatly
1315
from one platform to another. These power costs can be estimated using
1416
devicetree data in some cases. In others, the firmware will know better.
1517
Alternatively, userspace might be best positioned. And so on. In order to avoid
@@ -25,7 +27,7 @@ framework, and interested clients reading the data from it::
2527
+---------------+ +-----------------+ +---------------+
2628
| Thermal (IPA) | | Scheduler (EAS) | | Other |
2729
+---------------+ +-----------------+ +---------------+
28-
| | em_pd_energy() |
30+
| | em_cpu_energy() |
2931
| | em_cpu_get() |
3032
+---------+ | +---------+
3133
| | |
@@ -35,7 +37,7 @@ framework, and interested clients reading the data from it::
3537
| Framework |
3638
+---------------------+
3739
^ ^ ^
38-
| | | em_register_perf_domain()
40+
| | | em_dev_register_perf_domain()
3941
+----------+ | +---------+
4042
| | |
4143
+---------------+ +---------------+ +--------------+
@@ -47,12 +49,12 @@ framework, and interested clients reading the data from it::
4749
| Device Tree | | Firmware | | ? |
4850
+--------------+ +---------------+ +--------------+
4951

50-
The EM framework manages power cost tables per 'performance domain' in the
51-
system. A performance domain is a group of CPUs whose performance is scaled
52-
together. Performance domains generally have a 1-to-1 mapping with CPUFreq
53-
policies. All CPUs in a performance domain are required to have the same
54-
micro-architecture. CPUs in different performance domains can have different
55-
micro-architectures.
52+
In case of CPU devices the EM framework manages power cost tables per
53+
'performance domain' in the system. A performance domain is a group of CPUs
54+
whose performance is scaled together. Performance domains generally have a
55+
1-to-1 mapping with CPUFreq policies. All CPUs in a performance domain are
56+
required to have the same micro-architecture. CPUs in different performance
57+
domains can have different micro-architectures.
5658

5759

5860
2. Core APIs
@@ -70,28 +72,37 @@ CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
7072
Drivers are expected to register performance domains into the EM framework by
7173
calling the following API::
7274

73-
int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
74-
struct em_data_callback *cb);
75+
int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
76+
struct em_data_callback *cb, cpumask_t *cpus);
7577

76-
Drivers must specify the CPUs of the performance domains using the cpumask
77-
argument, and provide a callback function returning <frequency, power> tuples
78-
for each capacity state. The callback function provided by the driver is free
78+
Drivers must provide a callback function returning <frequency, power> tuples
79+
for each performance state. The callback function provided by the driver is free
7980
to fetch data from any relevant location (DT, firmware, ...), and by any mean
80-
deemed necessary. See Section 3. for an example of driver implementing this
81+
deemed necessary. Only for CPU devices, drivers must specify the CPUs of the
82+
performance domains using cpumask. For other devices than CPUs the last
83+
argument must be set to NULL.
84+
See Section 3. for an example of driver implementing this
8185
callback, and kernel/power/energy_model.c for further documentation on this
8286
API.
8387

8488

8589
2.3 Accessing performance domains
8690
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8791

92+
There are two API functions which provide the access to the energy model:
93+
em_cpu_get() which takes CPU id as an argument and em_pd_get() with device
94+
pointer as an argument. It depends on the subsystem which interface it is
95+
going to use, but in case of CPU devices both functions return the same
96+
performance domain.
97+
8898
Subsystems interested in the energy model of a CPU can retrieve it using the
8999
em_cpu_get() API. The energy model tables are allocated once upon creation of
90100
the performance domains, and kept in memory untouched.
91101

92102
The energy consumed by a performance domain can be estimated using the
93-
em_pd_energy() API. The estimation is performed assuming that the schedutil
94-
CPUfreq governor is in use.
103+
em_cpu_energy() API. The estimation is performed assuming that the schedutil
104+
CPUfreq governor is in use in case of CPU device. Currently this calculation is
105+
not provided for other type of devices.
95106

96107
More details about the above APIs can be found in include/linux/energy_model.h.
97108

@@ -106,42 +117,46 @@ EM framework::
106117

107118
-> drivers/cpufreq/foo_cpufreq.c
108119

109-
01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu)
110-
02 {
111-
03 long freq, power;
112-
04
113-
05 /* Use the 'foo' protocol to ceil the frequency */
114-
06 freq = foo_get_freq_ceil(cpu, *KHz);
115-
07 if (freq < 0);
116-
08 return freq;
117-
09
118-
10 /* Estimate the power cost for the CPU at the relevant freq. */
119-
11 power = foo_estimate_power(cpu, freq);
120-
12 if (power < 0);
121-
13 return power;
122-
14
123-
15 /* Return the values to the EM framework */
124-
16 *mW = power;
125-
17 *KHz = freq;
126-
18
127-
19 return 0;
128-
20 }
129-
21
130-
22 static int foo_cpufreq_init(struct cpufreq_policy *policy)
131-
23 {
132-
24 struct em_data_callback em_cb = EM_DATA_CB(est_power);
133-
25 int nr_opp, ret;
134-
26
135-
27 /* Do the actual CPUFreq init work ... */
136-
28 ret = do_foo_cpufreq_init(policy);
137-
29 if (ret)
138-
30 return ret;
139-
31
140-
32 /* Find the number of OPPs for this policy */
141-
33 nr_opp = foo_get_nr_opp(policy);
142-
34
143-
35 /* And register the new performance domain */
144-
36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
145-
37
146-
38 return 0;
147-
39 }
120+
01 static int est_power(unsigned long *mW, unsigned long *KHz,
121+
02 struct device *dev)
122+
03 {
123+
04 long freq, power;
124+
05
125+
06 /* Use the 'foo' protocol to ceil the frequency */
126+
07 freq = foo_get_freq_ceil(dev, *KHz);
127+
08 if (freq < 0);
128+
09 return freq;
129+
10
130+
11 /* Estimate the power cost for the dev at the relevant freq. */
131+
12 power = foo_estimate_power(dev, freq);
132+
13 if (power < 0);
133+
14 return power;
134+
15
135+
16 /* Return the values to the EM framework */
136+
17 *mW = power;
137+
18 *KHz = freq;
138+
19
139+
20 return 0;
140+
21 }
141+
22
142+
23 static int foo_cpufreq_init(struct cpufreq_policy *policy)
143+
24 {
144+
25 struct em_data_callback em_cb = EM_DATA_CB(est_power);
145+
26 struct device *cpu_dev;
146+
27 int nr_opp, ret;
147+
28
148+
29 cpu_dev = get_cpu_device(cpumask_first(policy->cpus));
149+
30
150+
31 /* Do the actual CPUFreq init work ... */
151+
32 ret = do_foo_cpufreq_init(policy);
152+
33 if (ret)
153+
34 return ret;
154+
35
155+
36 /* Find the number of OPPs for this policy */
156+
37 nr_opp = foo_get_nr_opp(policy);
157+
38
158+
39 /* And register the new performance domain */
159+
40 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus);
160+
41
161+
42 return 0;
162+
43 }

Documentation/power/powercap/powercap.rst

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -167,11 +167,13 @@ For example::
167167
package-0
168168
---------
169169

170-
The Intel RAPL technology allows two constraints, short term and long term,
171-
with two different time windows to be applied to each power zone. Thus for
172-
each zone there are 2 attributes representing the constraint names, 2 power
173-
limits and 2 attributes representing the sizes of the time windows. Such that,
174-
constraint_j_* attributes correspond to the jth constraint (j = 0,1).
170+
Depending on different power zones, the Intel RAPL technology allows
171+
one or multiple constraints like short term, long term and peak power,
172+
with different time windows to be applied to each power zone.
173+
All the zones contain attributes representing the constraint names,
174+
power limits and the sizes of the time windows. Note that time window
175+
is not applicable to peak power. Here, constraint_j_* attributes
176+
correspond to the jth constraint (j = 0,1,2).
175177

176178
For example::
177179

@@ -181,6 +183,9 @@ For example::
181183
constraint_1_name
182184
constraint_1_power_limit_uw
183185
constraint_1_time_window_us
186+
constraint_2_name
187+
constraint_2_power_limit_uw
188+
constraint_2_time_window_us
184189

185190
Power Zone Attributes
186191
=====================

MAINTAINERS

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11153,6 +11153,15 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl.git
1115311153
F: Documentation/devicetree/bindings/memory-controllers/
1115411154
F: drivers/memory/
1115511155

11156+
MEMORY FREQUENCY SCALING DRIVERS FOR NVIDIA TEGRA
11157+
M: Dmitry Osipenko <[email protected]>
11158+
11159+
11160+
T: git git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux.git
11161+
S: Maintained
11162+
F: drivers/devfreq/tegra20-devfreq.c
11163+
F: drivers/devfreq/tegra30-devfreq.c
11164+
1115611165
MEMORY MANAGEMENT
1115711166
M: Andrew Morton <[email protected]>
1115811167

arch/powerpc/platforms/cell/cpufreq_spudemand.c

Lines changed: 2 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -126,30 +126,8 @@ static struct cpufreq_governor spu_governor = {
126126
.stop = spu_gov_stop,
127127
.owner = THIS_MODULE,
128128
};
129-
130-
/*
131-
* module init and destoy
132-
*/
133-
134-
static int __init spu_gov_init(void)
135-
{
136-
int ret;
137-
138-
ret = cpufreq_register_governor(&spu_governor);
139-
if (ret)
140-
printk(KERN_ERR "registration of governor failed\n");
141-
return ret;
142-
}
143-
144-
static void __exit spu_gov_exit(void)
145-
{
146-
cpufreq_unregister_governor(&spu_governor);
147-
}
148-
149-
150-
module_init(spu_gov_init);
151-
module_exit(spu_gov_exit);
129+
cpufreq_governor_init(spu_governor);
130+
cpufreq_governor_exit(spu_governor);
152131

153132
MODULE_LICENSE("GPL");
154133
MODULE_AUTHOR("Christian Krafft <[email protected]>");
155-

arch/x86/include/asm/msr-index.h

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,10 @@
149149

150150
#define MSR_LBR_SELECT 0x000001c8
151151
#define MSR_LBR_TOS 0x000001c9
152+
153+
#define MSR_IA32_POWER_CTL 0x000001fc
154+
#define MSR_IA32_POWER_CTL_BIT_EE 19
155+
152156
#define MSR_LBR_NHM_FROM 0x00000680
153157
#define MSR_LBR_NHM_TO 0x000006c0
154158
#define MSR_LBR_CORE_FROM 0x00000040
@@ -269,8 +273,6 @@
269273

270274
#define MSR_PEBS_FRONTEND 0x000003f7
271275

272-
#define MSR_IA32_POWER_CTL 0x000001fc
273-
274276
#define MSR_IA32_MC0_CTL 0x00000400
275277
#define MSR_IA32_MC0_STATUS 0x00000401
276278
#define MSR_IA32_MC0_ADDR 0x00000402

0 commit comments

Comments
 (0)