|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +.. include:: <isonum.txt> |
| 3 | + |
| 4 | +============================================== |
| 5 | +``intel_idle`` CPU Idle Time Management Driver |
| 6 | +============================================== |
| 7 | + |
| 8 | +:Copyright: |copy| 2020 Intel Corporation |
| 9 | + |
| 10 | +:Author: Rafael J. Wysocki < [email protected]> |
| 11 | + |
| 12 | + |
| 13 | +General Information |
| 14 | +=================== |
| 15 | + |
| 16 | +``intel_idle`` is a part of the |
| 17 | +:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel |
| 18 | +(``CPUIdle``). It is the default CPU idle time management driver for the |
| 19 | +Nehalem and later generations of Intel processors, but the level of support for |
| 20 | +a particular processor model in it depends on whether or not it recognizes that |
| 21 | +processor model and may also depend on information coming from the platform |
| 22 | +firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle`` |
| 23 | +works in general, so this is the time to get familiar with :doc:`cpuidle` if you |
| 24 | +have not done that yet.] |
| 25 | + |
| 26 | +``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the |
| 27 | +logical CPU executing it is idle and so it may be possible to put some of the |
| 28 | +processor's functional blocks into low-power states. That instruction takes two |
| 29 | +arguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the |
| 30 | +first of which, referred to as a *hint*, can be used by the processor to |
| 31 | +determine what can be done (for details refer to Intel Software Developer’s |
| 32 | +Manual [1]_). Accordingly, ``intel_idle`` refuses to work with processors in |
| 33 | +which the support for the ``MWAIT`` instruction has been disabled (for example, |
| 34 | +via the platform firmware configuration menu) or which do not support that |
| 35 | +instruction at all. |
| 36 | + |
| 37 | +``intel_idle`` is not modular, so it cannot be unloaded, which means that the |
| 38 | +only way to pass early-configuration-time parameters to it is via the kernel |
| 39 | +command line. |
| 40 | + |
| 41 | + |
| 42 | +.. _intel-idle-enumeration-of-states: |
| 43 | + |
| 44 | +Enumeration of Idle States |
| 45 | +========================== |
| 46 | + |
| 47 | +Each ``MWAIT`` hint value is interpreted by the processor as a license to |
| 48 | +reconfigure itself in a certain way in order to save energy. The processor |
| 49 | +configurations (with reduced power draw) resulting from that are referred to |
| 50 | +as C-states (in the ACPI terminology) or idle states. The list of meaningful |
| 51 | +``MWAIT`` hint values and idle states (i.e. low-power configurations of the |
| 52 | +processor) corresponding to them depends on the processor model and it may also |
| 53 | +depend on the configuration of the platform. |
| 54 | + |
| 55 | +In order to create a list of available idle states required by the ``CPUIdle`` |
| 56 | +subsystem (see :ref:`idle-states-representation` in :doc:`cpuidle`), |
| 57 | +``intel_idle`` can use two sources of information: static tables of idle states |
| 58 | +for different processor models included in the driver itself and the ACPI tables |
| 59 | +of the system. The former are always used if the processor model at hand is |
| 60 | +recognized by ``intel_idle`` and the latter are used if that is required for |
| 61 | +the given processor model (which is the case for all server processor models |
| 62 | +recognized by ``intel_idle``) or if the processor model is not recognized. |
| 63 | + |
| 64 | +If the ACPI tables are going to be used for building the list of available idle |
| 65 | +states, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI |
| 66 | +objects corresponding to the CPUs in the system (refer to the ACPI specification |
| 67 | +[2]_ for the description of ``_CST`` and its output package). Because the |
| 68 | +``CPUIdle`` subsystem expects that the list of idle states supplied by the |
| 69 | +driver will be suitable for all of the CPUs handled by it and ``intel_idle`` is |
| 70 | +registered as the ``CPUIdle`` driver for all of the CPUs in the system, the |
| 71 | +driver looks for the first ``_CST`` object returning at least one valid idle |
| 72 | +state description and such that all of the idle states included in its return |
| 73 | +package are of the FFH (Functional Fixed Hardware) type, which means that the |
| 74 | +``MWAIT`` instruction is expected to be used to tell the processor that it can |
| 75 | +enter one of them. The return package of that ``_CST`` is then assumed to be |
| 76 | +applicable to all of the other CPUs in the system and the idle state |
| 77 | +descriptions extracted from it are stored in a preliminary list of idle states |
| 78 | +coming from the ACPI tables. [This step is skipped if ``intel_idle`` is |
| 79 | +configured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.] |
| 80 | + |
| 81 | +Next, the first (index 0) entry in the list of available idle states is |
| 82 | +initialized to represent a "polling idle state" (a pseudo-idle state in which |
| 83 | +the target CPU continuously fetches and executes instructions), and the |
| 84 | +subsequent (real) idle state entries are populated as follows. |
| 85 | + |
| 86 | +If the processor model at hand is recognized by ``intel_idle``, there is a |
| 87 | +(static) table of idle state descriptions for it in the driver. In that case, |
| 88 | +the "internal" table is the primary source of information on idle states and the |
| 89 | +information from it is copied to the final list of available idle states. If |
| 90 | +using the ACPI tables for the enumeration of idle states is not required |
| 91 | +(depending on the processor model), all of the listed idle state are enabled by |
| 92 | +default (so all of them will be taken into consideration by ``CPUIdle`` |
| 93 | +governors during CPU idle state selection). Otherwise, some of the listed idle |
| 94 | +states may not be enabled by default if there are no matching entries in the |
| 95 | +preliminary list of idle states coming from the ACPI tables. In that case user |
| 96 | +space still can enable them later (on a per-CPU basis) with the help of |
| 97 | +the ``disable`` idle state attribute in ``sysfs`` (see |
| 98 | +:ref:`idle-states-representation` in :doc:`cpuidle`). This basically means that |
| 99 | +the idle states "known" to the driver may not be enabled by default if they have |
| 100 | +not been exposed by the platform firmware (through the ACPI tables). |
| 101 | + |
| 102 | +If the given processor model is not recognized by ``intel_idle``, but it |
| 103 | +supports ``MWAIT``, the preliminary list of idle states coming from the ACPI |
| 104 | +tables is used for building the final list that will be supplied to the |
| 105 | +``CPUIdle`` core during driver registration. For each idle state in that list, |
| 106 | +the description, ``MWAIT`` hint and exit latency are copied to the corresponding |
| 107 | +entry in the final list of idle states. The name of the idle state represented |
| 108 | +by it (to be returned by the ``name`` idle state attribute in ``sysfs``) is |
| 109 | +"CX_ACPI", where X is the index of that idle state in the final list (note that |
| 110 | +the minimum value of X is 1, because 0 is reserved for the "polling" state), and |
| 111 | +its target residency is based on the exit latency value. Specifically, for |
| 112 | +C1-type idle states the exit latency value is also used as the target residency |
| 113 | +(for compatibility with the majority of the "internal" tables of idle states for |
| 114 | +various processor models recognized by ``intel_idle``) and for the other idle |
| 115 | +state types (C2 and C3) the target residency value is 3 times the exit latency |
| 116 | +(again, that is because it reflects the target residency to exit latency ratio |
| 117 | +in the majority of cases for the processor models recognized by ``intel_idle``). |
| 118 | +All of the idle states in the final list are enabled by default in this case. |
| 119 | + |
| 120 | + |
| 121 | +.. _intel-idle-initialization: |
| 122 | + |
| 123 | +Initialization |
| 124 | +============== |
| 125 | + |
| 126 | +The initialization of ``intel_idle`` starts with checking if the kernel command |
| 127 | +line options forbid the use of the ``MWAIT`` instruction. If that is the case, |
| 128 | +an error code is returned right away. |
| 129 | + |
| 130 | +The next step is to check whether or not the processor model is known to the |
| 131 | +driver, which determines the idle states enumeration method (see |
| 132 | +`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor |
| 133 | +supports ``MWAIT`` (the initialization fails if that is not the case). Then, |
| 134 | +the ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the |
| 135 | +driver initialization fails if the level of support is not as expected (for |
| 136 | +example, if the total number of ``MWAIT`` substates returned is 0). |
| 137 | + |
| 138 | +Next, if the driver is not configured to ignore the ACPI tables (see |
| 139 | +`below <intel-idle-parameters_>`_), the idle states information provided by the |
| 140 | +platform firmware is extracted from them. |
| 141 | + |
| 142 | +Then, ``CPUIdle`` device objects are allocated for all CPUs and the list of |
| 143 | +available idle states is created as explained |
| 144 | +`above <intel-idle-enumeration-of-states_>`_. |
| 145 | + |
| 146 | +Finally, ``intel_idle`` is registered with the help of cpuidle_register_driver() |
| 147 | +as the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback |
| 148 | +for configuring individual CPUs is registered via cpuhp_setup_state(), which |
| 149 | +(among other things) causes the callback routine to be invoked for all of the |
| 150 | +CPUs present in the system at that time (each CPU executes its own instance of |
| 151 | +the callback routine). That routine registers a ``CPUIdle`` device for the CPU |
| 152 | +running it (which enables the ``CPUIdle`` subsystem to operate that CPU) and |
| 153 | +optionally performs some CPU-specific initialization actions that may be |
| 154 | +required for the given processor model. |
| 155 | + |
| 156 | + |
| 157 | +.. _intel-idle-parameters: |
| 158 | + |
| 159 | +Kernel Command Line Options and Module Parameters |
| 160 | +================================================= |
| 161 | + |
| 162 | +The *x86* architecture support code recognizes three kernel command line |
| 163 | +options related to CPU idle time management: ``idle=poll``, ``idle=halt``, |
| 164 | +and ``idle=nomwait``. If any of them is present in the kernel command line, the |
| 165 | +``MWAIT`` instruction is not allowed to be used, so the initialization of |
| 166 | +``intel_idle`` will fail. |
| 167 | + |
| 168 | +Apart from that there are two module parameters recognized by ``intel_idle`` |
| 169 | +itself that can be set via the kernel command line (they cannot be updated via |
| 170 | +sysfs, so that is the only way to change their values). |
| 171 | + |
| 172 | +The ``max_cstate`` parameter value is the maximum idle state index in the list |
| 173 | +of idle states supplied to the ``CPUIdle`` core during the registration of the |
| 174 | +driver. It is also the maximum number of regular (non-polling) idle states that |
| 175 | +can be used by ``intel_idle``, so the enumeration of idle states is terminated |
| 176 | +after finding that number of usable idle states (the other idle states that |
| 177 | +potentially might have been used if ``max_cstate`` had been greater are not |
| 178 | +taken into consideration at all). Setting ``max_cstate`` can prevent |
| 179 | +``intel_idle`` from exposing idle states that are regarded as "too deep" for |
| 180 | +some reason to the ``CPUIdle`` core, but it does so by making them effectively |
| 181 | +invisible until the system is shut down and started again which may not always |
| 182 | +be desirable. In practice, it is only really necessary to do that if the idle |
| 183 | +states in question cannot be enabled during system startup, because in the |
| 184 | +working state of the system the CPU power management quality of service (PM |
| 185 | +QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states |
| 186 | +even if they have been enumerated (see :ref:`cpu-pm-qos` in :doc:`cpuidle`). |
| 187 | +Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail. |
| 188 | + |
| 189 | +The ``noacpi`` module parameter (which is recognized by ``intel_idle`` if the |
| 190 | +kernel has been configured with ACPI support), can be set to make the driver |
| 191 | +ignore the system's ACPI tables entirely (it is unset by default). |
| 192 | + |
| 193 | + |
| 194 | +.. _intel-idle-core-and-package-idle-states: |
| 195 | + |
| 196 | +Core and Package Levels of Idle States |
| 197 | +====================================== |
| 198 | + |
| 199 | +Typically, in a processor supporting the ``MWAIT`` instruction there are (at |
| 200 | +least) two levels of idle states (or C-states). One level, referred to as |
| 201 | +"core C-states", covers individual cores in the processor, whereas the other |
| 202 | +level, referred to as "package C-states", covers the entire processor package |
| 203 | +and it may also involve other components of the system (GPUs, memory |
| 204 | +controllers, I/O hubs etc.). |
| 205 | + |
| 206 | +Some of the ``MWAIT`` hint values allow the processor to use core C-states only |
| 207 | +(most importantly, that is the case for the ``MWAIT`` hint value corresponding |
| 208 | +to the ``C1`` idle state), but the majority of them give it a license to put |
| 209 | +the target core (i.e. the core containing the logical CPU executing ``MWAIT`` |
| 210 | +with the given hint value) into a specific core C-state and then (if possible) |
| 211 | +to enter a specific package C-state at the deeper level. For example, the |
| 212 | +``MWAIT`` hint value representing the ``C3`` idle state allows the processor to |
| 213 | +put the target core into the low-power state referred to as "core ``C3``" (or |
| 214 | +``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core |
| 215 | +have executed ``MWAIT`` with the ``C3`` hint value (or with a hint value |
| 216 | +representing a deeper idle state), and in addition to that (in the majority of |
| 217 | +cases) it gives the processor a license to put the entire package (possibly |
| 218 | +including some non-CPU components such as a GPU or a memory controller) into the |
| 219 | +low-power state referred to as "package ``C3``" (or ``PC3``), which happens if |
| 220 | +all of the cores have gone into the ``CC3`` state and (possibly) some additional |
| 221 | +conditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may |
| 222 | +be required to be in a certain GPU-specific low-power state for ``PC3`` to be |
| 223 | +reachable). |
| 224 | + |
| 225 | +As a rule, there is no simple way to make the processor use core C-states only |
| 226 | +if the conditions for entering the corresponding package C-states are met, so |
| 227 | +the logical CPU executing ``MWAIT`` with a hint value that is not core-level |
| 228 | +only (like for ``C1``) must always assume that this may cause the processor to |
| 229 | +enter a package C-state. [That is why the exit latency and target residency |
| 230 | +values corresponding to the majority of ``MWAIT`` hint values in the "internal" |
| 231 | +tables of idle states in ``intel_idle`` reflect the properties of package |
| 232 | +C-states.] If using package C-states is not desirable at all, either |
| 233 | +:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of |
| 234 | +``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to |
| 235 | +restrict the range of permissible idle states to the ones with core-level only |
| 236 | +``MWAIT`` hint values (like ``C1``). |
| 237 | + |
| 238 | + |
| 239 | +References |
| 240 | +========== |
| 241 | + |
| 242 | +.. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*, |
| 243 | + https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html |
| 244 | +
|
| 245 | +.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*, |
| 246 | + https://uefi.org/specifications |
0 commit comments