Skip to content

Commit 6d277ac

Browse files
committed
Merge tag 'pm-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki: "These add ACPI support to the intel_idle driver along with an admin guide document for it, add support for CPR (Core Power Reduction) to the AVS (Adaptive Voltage Scaling) subsystem, add new hardware support in a few places, add some new sysfs attributes, debugfs files and tracepoints, fix bugs and clean up a bunch of things all over. Specifics: - Update the ACPI processor driver in order to export acpi_processor_evaluate_cst() to the code outside of it, add ACPI support to the intel_idle driver based on that and clean up that driver somewhat (Rafael Wysocki). - Add an admin guide document for the intel_idle driver (Rafael Wysocki). - Clean up cpuidle core and drivers, enable compilation testing for some of them (Benjamin Gaignard, Krzysztof Kozlowski, Rafael Wysocki, Yangtao Li). - Fix reference counting of OPP (operating performance points) table structures (Viresh Kumar). - Add support for CPR (Core Power Reduction) to the AVS (Adaptive Voltage Scaling) subsystem (Niklas Cassel, Colin Ian King, YueHaibing). - Add support for TigerLake Mobile and JasperLake to the Intel RAPL power capping driver (Zhang Rui). - Update cpufreq drivers: - Add i.MX8MP support to imx-cpufreq-dt (Anson Huang). - Fix usage of a macro in loongson2_cpufreq (Alexandre Oliva). - Fix cpufreq policy reference counting issues in s3c and brcmstb-avs (chenqiwu). - Fix ACPI table reference counting issue and HiSilicon quirk handling in the CPPC driver (Hanjun Guo). - Clean up spelling mistake in intel_pstate (Harry Pan). - Convert the kirkwood and tegra186 drivers to using devm_platform_ioremap_resource() (Yangtao Li). - Update devfreq core: - Add 'name' sysfs attribute for devfreq devices (Chanwoo Choi). - Clean up the handing of transition statistics and allow them to be reset by writing 0 to the 'trans_stat' devfreq device attribute in sysfs (Kamil Konieczny). - Add 'devfreq_summary' to debugfs (Chanwoo Choi). - Clean up kerneldoc comments and Kconfig indentation (Krzysztof Kozlowski, Randy Dunlap). - Update devfreq drivers: - Add dynamic scaling for the imx8m DDR controller and clean up imx8m-ddrc (Leonard Crestez, YueHaibing). - Fix DT node reference counting and nitialization error code path in rk3399_dmc and add COMPILE_TEST and HAVE_ARM_SMCCC dependency for it (Chanwoo Choi, Yangtao Li). - Fix DT node reference counting in rockchip-dfi and make it use devm_platform_ioremap_resource() (Yangtao Li). - Fix excessive stack usage in exynos-ppmu (Arnd Bergmann). - Fix initialization error code paths in exynos-bus (Yangtao Li). - Clean up exynos-bus and exynos somewhat (Artur Świgoń, Krzysztof Kozlowski). - Add tracepoints for tracking usage_count updates unrelated to status changes in PM-runtime (Michał Mirosław). - Add sysfs attribute to control the "sync on suspend" behavior during system-wide suspend (Jonas Meurer). - Switch system-wide suspend tests over to 64-bit time (Alexandre Belloni). - Make wakeup sources statistics in debugfs cover deleted ones which used to be the case some time ago (zhuguangqing). - Clean up computations carried out during hibernation, update messages related to hibernation and fix a spelling mistake in one of them (Wen Yang, Luigi Semenzato, Colin Ian King). - Add mailmap entry for maintainer e-mail address that has not been functional for several years (Rafael Wysocki)" * tag 'pm-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (83 commits) cpufreq: loongson2_cpufreq: adjust cpufreq uses of LOONGSON_CHIPCFG intel_idle: Clean up irtl_2_usec() intel_idle: Move 3 functions closer to their callers intel_idle: Annotate initialization code and data structures intel_idle: Move and clean up intel_idle_cpuidle_devices_uninit() intel_idle: Rearrange intel_idle_cpuidle_driver_init() intel_idle: Clean up NULL pointer check in intel_idle_init() intel_idle: Fold intel_idle_probe() into intel_idle_init() intel_idle: Eliminate __setup_broadcast_timer() cpuidle: fix cpuidle_find_deepest_state() kerneldoc warnings cpuidle: sysfs: fix warnings when compiling with W=1 cpuidle: coupled: fix warnings when compiling with W=1 cpufreq: brcmstb-avs: fix imbalance of cpufreq policy refcount PM: suspend: Add sysfs attribute to control the "sync on suspend" behavior PM / devfreq: Add debugfs support with devfreq_summary file Documentation: admin-guide: PM: Add intel_idle document cpuidle: arm: Enable compile testing for some of drivers PM-runtime: add tracepoints for usage_count changes cpufreq: intel_pstate: fix spelling mistake: "Whethet" -> "Whether" PM: hibernate: fix spelling mistake "shapshot" -> "snapshot" ...
2 parents aae1464 + c102671 commit 6d277ac

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+3829
-606
lines changed

.mailmap

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,7 @@ Praveen BP <[email protected]>
217217
218218
219219
220+
220221
Rajesh Shah <[email protected]>
221222
Ralf Baechle <[email protected]>
222223
Ralf Wildenhues <[email protected]>

Documentation/ABI/testing/sysfs-class-devfreq

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,13 @@ Description:
77
The name of devfreq object denoted as ... is same as the
88
name of device using devfreq.
99

10+
What: /sys/class/devfreq/.../name
11+
Date: November 2019
12+
Contact: Chanwoo Choi <[email protected]>
13+
Description:
14+
The /sys/class/devfreq/.../name shows the name of device
15+
of the corresponding devfreq object.
16+
1017
What: /sys/class/devfreq/.../governor
1118
Date: September 2011
1219
Contact: MyungJoo Ham <[email protected]>
@@ -48,12 +55,15 @@ What: /sys/class/devfreq/.../trans_stat
4855
Date: October 2012
4956
Contact: MyungJoo Ham <[email protected]>
5057
Description:
51-
This ABI shows the statistics of devfreq behavior on a
52-
specific device. It shows the time spent in each state and
53-
the number of transitions between states.
58+
This ABI shows or clears the statistics of devfreq behavior
59+
on a specific device. It shows the time spent in each state
60+
and the number of transitions between states.
5461
In order to activate this ABI, the devfreq target device
5562
driver should provide the list of available frequencies
56-
with its profile.
63+
with its profile. If need to reset the statistics of devfreq
64+
behavior on a specific device, enter 0(zero) to 'trans_stat'
65+
as following:
66+
echo 0 > /sys/class/devfreq/.../trans_stat
5767

5868
What: /sys/class/devfreq/.../userspace/set_freq
5969
Date: September 2011

Documentation/ABI/testing/sysfs-devices-system-cpu

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,12 @@ Description:
196196
does not reflect it. Likewise, if one enables a deep state but a
197197
lighter state still is disabled, then this has no effect.
198198

199+
What: /sys/devices/system/cpu/cpuX/cpuidle/stateN/default_status
200+
Date: December 2019
201+
KernelVersion: v5.6
202+
Contact: Linux power management list <[email protected]>
203+
Description:
204+
(RO) The default status of this state, "enabled" or "disabled".
199205

200206
What: /sys/devices/system/cpu/cpuX/cpuidle/stateN/residency
201207
Date: March 2014

Documentation/ABI/testing/sysfs-power

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -407,3 +407,16 @@ Contact: Kalesh Singh <[email protected]>
407407
Description:
408408
The /sys/power/suspend_stats/last_failed_step file contains
409409
the last failed step in the suspend/resume path.
410+
411+
What: /sys/power/sync_on_suspend
412+
Date: October 2019
413+
Contact: Jonas Meurer <[email protected]>
414+
Description:
415+
This file controls whether or not the kernel will sync()
416+
filesystems during system suspend (after freezing user space
417+
and before suspending devices).
418+
419+
Writing a "1" to this file enables the sync() and writing a "0"
420+
disables it. Reads from the file return the current value.
421+
The default is "1" if the build-time "SUSPEND_SKIP_SYNC" config
422+
flag is unset, or "0" otherwise.

Documentation/admin-guide/pm/cpuidle.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -506,6 +506,9 @@ object corresponding to it, as follows:
506506
``disable``
507507
Whether or not this idle state is disabled.
508508

509+
``default_status``
510+
The default status of this state, "enabled" or "disabled".
511+
509512
``latency``
510513
Exit latency of the idle state in microseconds.
511514

Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
.. include:: <isonum.txt>
3+
4+
==============================================
5+
``intel_idle`` CPU Idle Time Management Driver
6+
==============================================
7+
8+
:Copyright: |copy| 2020 Intel Corporation
9+
10+
:Author: Rafael J. Wysocki <[email protected]>
11+
12+
13+
General Information
14+
===================
15+
16+
``intel_idle`` is a part of the
17+
:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel
18+
(``CPUIdle``). It is the default CPU idle time management driver for the
19+
Nehalem and later generations of Intel processors, but the level of support for
20+
a particular processor model in it depends on whether or not it recognizes that
21+
processor model and may also depend on information coming from the platform
22+
firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle``
23+
works in general, so this is the time to get familiar with :doc:`cpuidle` if you
24+
have not done that yet.]
25+
26+
``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the
27+
logical CPU executing it is idle and so it may be possible to put some of the
28+
processor's functional blocks into low-power states. That instruction takes two
29+
arguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the
30+
first of which, referred to as a *hint*, can be used by the processor to
31+
determine what can be done (for details refer to Intel Software Developer’s
32+
Manual [1]_). Accordingly, ``intel_idle`` refuses to work with processors in
33+
which the support for the ``MWAIT`` instruction has been disabled (for example,
34+
via the platform firmware configuration menu) or which do not support that
35+
instruction at all.
36+
37+
``intel_idle`` is not modular, so it cannot be unloaded, which means that the
38+
only way to pass early-configuration-time parameters to it is via the kernel
39+
command line.
40+
41+
42+
.. _intel-idle-enumeration-of-states:
43+
44+
Enumeration of Idle States
45+
==========================
46+
47+
Each ``MWAIT`` hint value is interpreted by the processor as a license to
48+
reconfigure itself in a certain way in order to save energy. The processor
49+
configurations (with reduced power draw) resulting from that are referred to
50+
as C-states (in the ACPI terminology) or idle states. The list of meaningful
51+
``MWAIT`` hint values and idle states (i.e. low-power configurations of the
52+
processor) corresponding to them depends on the processor model and it may also
53+
depend on the configuration of the platform.
54+
55+
In order to create a list of available idle states required by the ``CPUIdle``
56+
subsystem (see :ref:`idle-states-representation` in :doc:`cpuidle`),
57+
``intel_idle`` can use two sources of information: static tables of idle states
58+
for different processor models included in the driver itself and the ACPI tables
59+
of the system. The former are always used if the processor model at hand is
60+
recognized by ``intel_idle`` and the latter are used if that is required for
61+
the given processor model (which is the case for all server processor models
62+
recognized by ``intel_idle``) or if the processor model is not recognized.
63+
64+
If the ACPI tables are going to be used for building the list of available idle
65+
states, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI
66+
objects corresponding to the CPUs in the system (refer to the ACPI specification
67+
[2]_ for the description of ``_CST`` and its output package). Because the
68+
``CPUIdle`` subsystem expects that the list of idle states supplied by the
69+
driver will be suitable for all of the CPUs handled by it and ``intel_idle`` is
70+
registered as the ``CPUIdle`` driver for all of the CPUs in the system, the
71+
driver looks for the first ``_CST`` object returning at least one valid idle
72+
state description and such that all of the idle states included in its return
73+
package are of the FFH (Functional Fixed Hardware) type, which means that the
74+
``MWAIT`` instruction is expected to be used to tell the processor that it can
75+
enter one of them. The return package of that ``_CST`` is then assumed to be
76+
applicable to all of the other CPUs in the system and the idle state
77+
descriptions extracted from it are stored in a preliminary list of idle states
78+
coming from the ACPI tables. [This step is skipped if ``intel_idle`` is
79+
configured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.]
80+
81+
Next, the first (index 0) entry in the list of available idle states is
82+
initialized to represent a "polling idle state" (a pseudo-idle state in which
83+
the target CPU continuously fetches and executes instructions), and the
84+
subsequent (real) idle state entries are populated as follows.
85+
86+
If the processor model at hand is recognized by ``intel_idle``, there is a
87+
(static) table of idle state descriptions for it in the driver. In that case,
88+
the "internal" table is the primary source of information on idle states and the
89+
information from it is copied to the final list of available idle states. If
90+
using the ACPI tables for the enumeration of idle states is not required
91+
(depending on the processor model), all of the listed idle state are enabled by
92+
default (so all of them will be taken into consideration by ``CPUIdle``
93+
governors during CPU idle state selection). Otherwise, some of the listed idle
94+
states may not be enabled by default if there are no matching entries in the
95+
preliminary list of idle states coming from the ACPI tables. In that case user
96+
space still can enable them later (on a per-CPU basis) with the help of
97+
the ``disable`` idle state attribute in ``sysfs`` (see
98+
:ref:`idle-states-representation` in :doc:`cpuidle`). This basically means that
99+
the idle states "known" to the driver may not be enabled by default if they have
100+
not been exposed by the platform firmware (through the ACPI tables).
101+
102+
If the given processor model is not recognized by ``intel_idle``, but it
103+
supports ``MWAIT``, the preliminary list of idle states coming from the ACPI
104+
tables is used for building the final list that will be supplied to the
105+
``CPUIdle`` core during driver registration. For each idle state in that list,
106+
the description, ``MWAIT`` hint and exit latency are copied to the corresponding
107+
entry in the final list of idle states. The name of the idle state represented
108+
by it (to be returned by the ``name`` idle state attribute in ``sysfs``) is
109+
"CX_ACPI", where X is the index of that idle state in the final list (note that
110+
the minimum value of X is 1, because 0 is reserved for the "polling" state), and
111+
its target residency is based on the exit latency value. Specifically, for
112+
C1-type idle states the exit latency value is also used as the target residency
113+
(for compatibility with the majority of the "internal" tables of idle states for
114+
various processor models recognized by ``intel_idle``) and for the other idle
115+
state types (C2 and C3) the target residency value is 3 times the exit latency
116+
(again, that is because it reflects the target residency to exit latency ratio
117+
in the majority of cases for the processor models recognized by ``intel_idle``).
118+
All of the idle states in the final list are enabled by default in this case.
119+
120+
121+
.. _intel-idle-initialization:
122+
123+
Initialization
124+
==============
125+
126+
The initialization of ``intel_idle`` starts with checking if the kernel command
127+
line options forbid the use of the ``MWAIT`` instruction. If that is the case,
128+
an error code is returned right away.
129+
130+
The next step is to check whether or not the processor model is known to the
131+
driver, which determines the idle states enumeration method (see
132+
`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor
133+
supports ``MWAIT`` (the initialization fails if that is not the case). Then,
134+
the ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the
135+
driver initialization fails if the level of support is not as expected (for
136+
example, if the total number of ``MWAIT`` substates returned is 0).
137+
138+
Next, if the driver is not configured to ignore the ACPI tables (see
139+
`below <intel-idle-parameters_>`_), the idle states information provided by the
140+
platform firmware is extracted from them.
141+
142+
Then, ``CPUIdle`` device objects are allocated for all CPUs and the list of
143+
available idle states is created as explained
144+
`above <intel-idle-enumeration-of-states_>`_.
145+
146+
Finally, ``intel_idle`` is registered with the help of cpuidle_register_driver()
147+
as the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback
148+
for configuring individual CPUs is registered via cpuhp_setup_state(), which
149+
(among other things) causes the callback routine to be invoked for all of the
150+
CPUs present in the system at that time (each CPU executes its own instance of
151+
the callback routine). That routine registers a ``CPUIdle`` device for the CPU
152+
running it (which enables the ``CPUIdle`` subsystem to operate that CPU) and
153+
optionally performs some CPU-specific initialization actions that may be
154+
required for the given processor model.
155+
156+
157+
.. _intel-idle-parameters:
158+
159+
Kernel Command Line Options and Module Parameters
160+
=================================================
161+
162+
The *x86* architecture support code recognizes three kernel command line
163+
options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
164+
and ``idle=nomwait``. If any of them is present in the kernel command line, the
165+
``MWAIT`` instruction is not allowed to be used, so the initialization of
166+
``intel_idle`` will fail.
167+
168+
Apart from that there are two module parameters recognized by ``intel_idle``
169+
itself that can be set via the kernel command line (they cannot be updated via
170+
sysfs, so that is the only way to change their values).
171+
172+
The ``max_cstate`` parameter value is the maximum idle state index in the list
173+
of idle states supplied to the ``CPUIdle`` core during the registration of the
174+
driver. It is also the maximum number of regular (non-polling) idle states that
175+
can be used by ``intel_idle``, so the enumeration of idle states is terminated
176+
after finding that number of usable idle states (the other idle states that
177+
potentially might have been used if ``max_cstate`` had been greater are not
178+
taken into consideration at all). Setting ``max_cstate`` can prevent
179+
``intel_idle`` from exposing idle states that are regarded as "too deep" for
180+
some reason to the ``CPUIdle`` core, but it does so by making them effectively
181+
invisible until the system is shut down and started again which may not always
182+
be desirable. In practice, it is only really necessary to do that if the idle
183+
states in question cannot be enabled during system startup, because in the
184+
working state of the system the CPU power management quality of service (PM
185+
QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states
186+
even if they have been enumerated (see :ref:`cpu-pm-qos` in :doc:`cpuidle`).
187+
Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
188+
189+
The ``noacpi`` module parameter (which is recognized by ``intel_idle`` if the
190+
kernel has been configured with ACPI support), can be set to make the driver
191+
ignore the system's ACPI tables entirely (it is unset by default).
192+
193+
194+
.. _intel-idle-core-and-package-idle-states:
195+
196+
Core and Package Levels of Idle States
197+
======================================
198+
199+
Typically, in a processor supporting the ``MWAIT`` instruction there are (at
200+
least) two levels of idle states (or C-states). One level, referred to as
201+
"core C-states", covers individual cores in the processor, whereas the other
202+
level, referred to as "package C-states", covers the entire processor package
203+
and it may also involve other components of the system (GPUs, memory
204+
controllers, I/O hubs etc.).
205+
206+
Some of the ``MWAIT`` hint values allow the processor to use core C-states only
207+
(most importantly, that is the case for the ``MWAIT`` hint value corresponding
208+
to the ``C1`` idle state), but the majority of them give it a license to put
209+
the target core (i.e. the core containing the logical CPU executing ``MWAIT``
210+
with the given hint value) into a specific core C-state and then (if possible)
211+
to enter a specific package C-state at the deeper level. For example, the
212+
``MWAIT`` hint value representing the ``C3`` idle state allows the processor to
213+
put the target core into the low-power state referred to as "core ``C3``" (or
214+
``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core
215+
have executed ``MWAIT`` with the ``C3`` hint value (or with a hint value
216+
representing a deeper idle state), and in addition to that (in the majority of
217+
cases) it gives the processor a license to put the entire package (possibly
218+
including some non-CPU components such as a GPU or a memory controller) into the
219+
low-power state referred to as "package ``C3``" (or ``PC3``), which happens if
220+
all of the cores have gone into the ``CC3`` state and (possibly) some additional
221+
conditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may
222+
be required to be in a certain GPU-specific low-power state for ``PC3`` to be
223+
reachable).
224+
225+
As a rule, there is no simple way to make the processor use core C-states only
226+
if the conditions for entering the corresponding package C-states are met, so
227+
the logical CPU executing ``MWAIT`` with a hint value that is not core-level
228+
only (like for ``C1``) must always assume that this may cause the processor to
229+
enter a package C-state. [That is why the exit latency and target residency
230+
values corresponding to the majority of ``MWAIT`` hint values in the "internal"
231+
tables of idle states in ``intel_idle`` reflect the properties of package
232+
C-states.] If using package C-states is not desirable at all, either
233+
:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of
234+
``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to
235+
restrict the range of permissible idle states to the ones with core-level only
236+
``MWAIT`` hint values (like ``C1``).
237+
238+
239+
References
240+
==========
241+
242+
.. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*,
243+
https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html
244+
245+
.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*,
246+
https://uefi.org/specifications

Documentation/admin-guide/pm/working-state.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Working-State Power Management
88
:maxdepth: 2
99

1010
cpuidle
11+
intel_idle
1112
cpufreq
1213
intel_pstate
1314
intel_epb

0 commit comments

Comments
 (0)