Skip to content

Latest commit

 

History

History
347 lines (247 loc) · 14.3 KB

File metadata and controls

347 lines (247 loc) · 14.3 KB

TSC, APERF, and MPERF Counters

Table of Contents

Overview

This article explores TSC, APERF, and MPERF - three important counters in Intel CPUs. The aim is to clarify their operation and practical applications in Linux power management and performance tools.

The following counters are covered:

  • TSC (Time Stamp Counter), accessible via the RDTSC instruction
  • APERF: MSR 0xe8 (IA32_MSR_APERF)
  • MPERF: MSR 0xe7 (IA32_MSR_MPERF)

Disclaimer

This article is based on publicly available Intel documentation, open-source Linux code, and practical experience. While care has been taken to ensure accuracy, the information may not be complete or current. Readers should consult official Intel documentation for authoritative specifications.

Clocks

Before introducing the counters, let's discuss the relevant clock sources in Intel CPUs.

Crystal Clock

The crystal clock is generated by a physical crystal oscillator and provides the timing reference on modern Intel processors. The crystal clock frequency ranges from 24-38 MHz depending on the platform. On modern Intel client platforms, it is 38.4 MHz.

Some platforms expose the crystal clock frequency through CPUID.15H.ECX[31:0].

Bus Clock

The bus clock is a clock signal that provides an operational timing reference for CPU cores and other components. On modern Intel processors, the bus clock is typically derived from the crystal clock and operates at approximately 100 MHz.

Unlike the crystal clock, the bus clock frequency varies intentionally to support spread spectrum clocking, which reduces electromagnetic interference by spreading emissions across a wider frequency range.

Due to its variable nature, the bus clock is less suitable than the crystal clock for timing calculations that require high precision. On modern Intel platforms, the bus clock frequency can be determined using CPUID.16H. It is usually reported as 100 MHz, but this is an approximate value. The actual bus clock frequency may vary slightly due to spread spectrum techniques.

TSC

The TSC is a 64-bit counter accessible via the RDTSC instruction. It provides a monotonically increasing timestamp that software uses for performance measurement and timekeeping. TSC effectively measures elapsed time at a constant frequency.

On most modern Intel platforms, Linux uses TSC as the default clock source. Linux's high-resolution timer subsystem uses the CPU's LAPIC controller in TSC deadline mode, which allows arming timer interrupts by specifying a target TSC value to trigger the interrupt at.

Legacy TSC

In very old Intel processors, TSC stopped during C-states and scaled with CPU frequency changes. In later processors, TSC continued running in C-states but still scaled with frequency. This behavior made TSC impractical for time measurements since dynamic P-state transitions would affect the counter rate.

Invariant TSC

Modern Intel processors implement Invariant TSC, which increments at a constant rate regardless of P-state or C-state transitions, making it suitable for precise timing measurements even when the CPU frequency varies for power management reasons.

This article discusses only modern Intel platforms with Invariant TSC.

ART

The Always Running Timer (ART) serves as the master timing reference for TSC, operating at crystal clock frequency and continuing to run even during CPU low-power states.

ART can be conceptualized as a global counter that continuously increments and provides the reference for per-CPU TSC counters, which act as local derivatives of this global timing base.

From a software engineering perspective, ART is an unnecessary abstraction layer. It is sufficient to understand the crystal clock frequency in order to understand TSC. However, ART is mentioned in the Intel Software Developer's Manual, so is included here for completeness.

I think about it this way: The crystal clock serves many purposes beyond TSC. ART is specifically for TSC.

The Intel Software Developer's Manual provides the following formula for TSC:

TSC = P × ART + K

Where:

  • P: Ratio between TSC rate and crystal clock rate
  • ART: Always Running Timer value
  • K: Combined offset: IA32_TSC_ADJUST + VMX TSC offset

TSC Frequency

Many modern Intel platforms expose the TSC frequency through CPUID.15H:

TSC_frequency = ECX × (EBX/EAX)

If CPUID.15H is not supported, TSC frequency can be measured using HPET as a reference:

TSC_frequency = (TSC_end - TSC_start) / time_elapsed.

APERF/MPERF

APERF and MPERF are Model-Specific Registers (MSRs) that provide insights into CPU performance and frequency behavior.

A helpful mnemonic for remembering their roles:

  • APERF: Actual Performance: Scales with actual CPU frequency.
  • MPERF: Marketing Performance: Increments at a constant frequency, usually the same as the "marketing" (base) CPU frequency.

The "Marketing Performance" mnemonic for MPERF is informal but useful. It was coined by Len Brown, to easily distinguish between the two counters.

From Intel SDM

Here is how the Intel Software Developer's Manual describes the APERF and MPERF counters as of November 2025.

  • The IA32_MPERF MSR (0xE7) increments in proportion to a fixed frequency, which is configured when the processor is booted.
  • The IA32_APERF MSR (0xE8) increments in proportion to actual performance, while accounting for hardware coordination of P-state and TM1/TM2; or software initiated throttling.
  • The MSRs are per logical processor; they measure performance only when the targeted processor is in the C0 state.
  • Only the IA32_APERF/IA32_MPERF ratio is architecturally defined; software should not attach meaning to the content of the individual bits of the IA32_APERF or IA32_MPERF MSRs.

And this one is mentioned separately:

  • By default, the IA32_MPERF counter counts during forced idle periods as if the logical processor was active. The IA32_APERF counter does not count during forced idle state. This counting convention allows the OS to compute the average effective frequency of the Logical Processor between the last MWAIT exit and the next MWAIT entry (OS visible C0) by ΔACNT/ΔMCNT * TSC Frequency.

Summary: Here is my interpretation of the key points about APERF and MPERF counters from Intel SDM:

  • APERF and MPERF are 64-bit counters, per-logical CPU.
  • APERF and MPERF are incrementing in C0 state only.
    • If a logical CPU runs HLT or requests a C-state via MWAIT, both APERF and MPERF stop incrementing.
  • MPERF increments at a constant rate, regardless of actual CPU frequency.
  • APERF increments at a rate proportional to actual CPU frequency.
  • There is no architecturally defined meaning for APERF and MPERF, other than their ratio.
  • APERF/MPERF ratio can be used to measure CPU performance and frequency.
  • MPERF continues counting during "forced idle" (duty cycles), while APERF does not.

Note: Duty cycles are hardware-injected idle CPU periods that limit power consumption under conditions like thermal constraints. Software can also request duty cycle injection. See Intel SDM for details.

The rest of the APERF/MPERF description is based on my experience and source code analysis of the open-source Linux codebase.

MPERF

TSC and MPERF frequencies can be easily measured in Linux by sampling counters over a known time interval. For example, this perf command takes samples every second on CPU 0:

# Sample every second while running a busy loop on CPU 0 to prevent idle states. Use counters
# grouping ("{}") to ensure that TSC and MPERF are sampled in a single system call.
perf stat -I 1000 -C 0 -A -e '{msr/tsc/,msr/mperf/}' -- taskset -c 0 sh -c 'while true; do :; done'

In practice, the measured MPERF frequency is very close to TSC frequency on all platforms I have worked with. The difference between TSC and MPERF frequencies I observed has been less than 1%.

This was true on both server and hybrid client platforms (for both E-cores and P-cores). But Intel SDM does not guarantee this behavior, so it may not hold true in the future.

APERF

Similarly, APERF frequency can be measured in Linux using the perf tool while pinning CPU frequency to various values. It turns out that APERF increments with a frequency very close to the actual CPU frequency.

If you let the CPU frequency scale freely, APERF frequency will be an average CPU frequency over the measurement interval.

This was true on both server and hybrid client platforms (for both E-cores and P-cores) I have worked with. However, Intel SDM does not guarantee this behavior, so it may not hold true in the future.

The Reference Clock

While TSC operates based on the crystal clock frequency, CPU frequency along with APERF and MPERF counters are based on the bus clock. The bus clock frequency varies to reduce electromagnetic interference, which explains why MPERF frequency is close to TSC frequency, but not identical.

Usage

This section discusses some practical applications of TSC, APERF, and MPERF counters, with examples from the open-source Linux turbostat tool developed by Len Brown.

Busy%

Turbostat's "Busy%" metric indicates the percentage of time the CPU spent in C0 state. It is calculated as:

Busy% = (ΔMPERF / ΔTSC) × 100%

Where:

  • ΔMPERF = Change in MPERF counter over the measurement interval
  • ΔTSC = Change in TSC counter over the measurement interval

Indeed, since MPERF counts only in C0 state, while TSC counts all the time, the ratio of MPERF to TSC gives the fraction of time spent in C0 state.

Because MPERF and TSC are based on different clocks (bus clock vs crystal clock), they run at slightly different rates. Therefore, the ratio is not precisely accurate, but good enough for practical purposes.

However, Intel SDM explicitly states that there is no architecturally defined meaning for MPERF other than its ratio with APERF. So this calculation is based on practical experience rather than official documentation, and it may not hold true in the future. However, it has been working reliably for many years on a variety of Intel platforms (all platforms I've worked with). The tool is open source, and if there was a problem with this calculation, someone would probably have reported it by now. If this ever breaks, a possible solution could be to use the CPU_CLK_UNHALTED.REF_TSC performance counter instead of MPERF.

AvgMHz

Turbostat's "AvgMHz" metric represents the average CPU frequency over the measurement interval. It is calculated as:

AvgMHz = ΔAPERF / (Interval_in_seconds × 1,000,000)

Where:

  • ΔAPERF: Change in APERF counter over the measurement interval
  • Interval_in_seconds: The measurement interval in seconds (default is 1 second)
  • 1,000,000: Conversion factor from Hz to MHz

The AvgMHz metric is affected by C-state residency. For example, if a CPU remains in a C-state for the entire measurement interval, AvgMHz will be 0.

Similar to the Busy% calculation, Intel SDM does not architecturally guarantee any specific meaning for APERF beyond its ratio with MPERF. This calculation is based on practical experience and empirical observation rather than official specification.

BzyMHz

Turbostat's "BzyMHz" metric represents the average CPU frequency while the processor is active (in C0 state). It is calculated as:

BzyMHz = BaseFreqMHz × (ΔAPERF / ΔMPERF)

Where:

  • BaseFreqMHz: Base frequency in MHz
  • ΔAPERF: Change in APERF counter over the measurement interval
  • ΔMPERF: Change in MPERF counter over the measurement interval

BaseFreqMHz is obtained either via CPUID.16H or calculated as 'ΔTSC / Interval_in_seconds'. The latter assumes that TSC frequency approximates the base frequency.

The calculation works as follows:

  • ΔAPERF / ΔMPERF provides the ratio of actual frequency to base frequency during the C0 state.
  • Multiplying this ratio by the base frequency yields the average CPU frequency while in the C0 state.

Base vs Guaranteed Frequency

Intel CPUs report their base frequency via CPUID.16H or MSR_PLATFORM_INFO (0xCE), bits 8-15. The latter provides the "maximum non-turbo ratio", which multiplied by the bus clock frequency yields the base frequency.

Traditionally, "base frequency" represented the guaranteed, sustainable CPU frequency that could run indefinitely under the processor's TDP (Thermal Design Power) rating. Any frequencies above base were considered "turbo" - opportunistic boosts that were not guaranteed and depended on thermal and power headroom.

Server platforms typically implement robust cooling solutions that enable CPUs to sustain their base frequency continuously, maintaining the traditional relationship between base and guaranteed frequency for Intel server processors.

On client platforms, the same CPU model may be deployed across vastly different thermal envelopes - from ultra-low-power designs in thin laptops to high-performance configurations in gaming systems. The actual guaranteed frequency varies based on the thermal solution, even for identical CPU models.

Creating hundreds of distinct CPU SKUs to match every possible thermal design would be impractical. Instead, Intel ships a limited number of CPU SKUs with a base frequency (as reported by CPUID.16H or MSR_PLATFORM_INFO) that may represent the guaranteed frequency only in the high-end thermal designs. Systems with more constrained cooling have lower guaranteed frequencies while having the same base frequency.

Consequently, on modern Intel client platforms, the base frequency no longer reliably indicates the guaranteed sustainable CPU frequency for a given system.