You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/topdown-compare/1-top-down.md
+24-9Lines changed: 24 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,31 +6,46 @@ weight: 3
6
6
layout: learningpathall
7
7
---
8
8
9
-
## What are the differences between Arm and Intel x86 PMU counters?
9
+
## What are the differences between Arm Neoverse and Intel x86 PMU counters?
10
10
11
-
This is a common question from both software developers and performance engineers working across architectures.
11
+
This is a common question from both software developers and performance engineers working with multiple architectures.
12
12
13
-
Both Intel x86 and Arm Neoverse CPUs provide sophisticated Performance Monitoring Units (PMUs) with hundreds of hardware counters. Instead of trying to list all available counters and compare microarchitecture, it makes more sense to focus on the performance methodologies they enable and the calculations used for performance metrics.
13
+
Both Intel x86 and Arm Neoverse CPUs provide sophisticated Performance Monitoring Units (PMUs) with hundreds of hardware counters. Instead of trying to list all available counters and compare microarchitectures, it makes more sense to focus on the performance methodologies they enable and the calculations used for performance metrics.
14
14
15
-
While the specific counter names and formulas differ between architectures, both Intel x86 and Arm Neoverse have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four key areas:
15
+
Although counter names and formulas differ, both Intel x86 and Arm Neoverse classify performance bottlenecks into the same four top-level categories:
16
16
17
17
- Retiring
18
18
- Bad Speculation
19
19
- Frontend Bound
20
20
- Backend Bound
21
21
22
-
This Learning Path provides a comparison of how x86 processors implement multi-level hierarchical top-down analysis compared to Arm Neoverse's methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas.
22
+
The first step is to focus on the dominant top-level bucket. Then, on Intel x86 you descend through the formal sub-levels. On Arm, you derive similar insights using architecture-specific event groups and formulas that approximate those sub-divisions.
23
+
24
+
This Learning Path compares Intel x86 Top-down Microarchitecture Analysis (a formal multi-level hierarchy) with Arm Neoverse top-down guidance (the same four level-1 buckets, but fewer standardized sub-levels). You will learn how the approaches align conceptually while noting differences in PMU event semantics and machine width.
23
25
24
26
## Introduction to top-down performance analysis
25
27
26
-
The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of the four categories.
28
+
The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline utilization. Instead of trying to interpret dozens of metrics, you can systematically identify bottlenecks by attributing CPU pipeline activity to one of the four categories.
29
+
30
+
A slot represents one potential opportunity for a processor core to issue and execute a micro-operation (µop) during a single clock cycle.
31
+
The total slots = (machine width × cycles), where each slot can be used productively or wasted through speculation or stalls.
32
+
33
+
**Retiring** represents slots that retire useful instructions (µops).
34
+
35
+
**Bad Speculation** accounts for slots consumed by mispredicted branches, pipeline flushes, or other speculative work that does not retire.
36
+
On Intel x86, this includes machine clears and on Arm Neoverse it is modeled through misprediction and refill events.
27
37
28
-
**Retiring**represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches and pipeline flushes. **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, whereas **Backend Bound** covers slots stalled by execution resource constraints such as cache misses or arithmetic unit availability.
38
+
**Frontend Bound**identifies slots lost because the core cannot supply enough decoded micro-ops. On Intel this subdivides into frontend latency (instruction cache, ITLB, branch predictor) versus frontend bandwidth (µop supply limits). On Arm Neoverse you approximate similar causes with instruction fetch, branch, and L1 I-cache events.
29
39
30
-
The methodology allows you to drill down only into the dominant bottleneck category, avoiding the complexity of analyzing all possible performance issues at the same time.
40
+
**Backend Bound** covers slots where issued micro-ops wait on data or execution resources. Intel x86 subdivides this into memory bound (cache / memory hierarchy latency or bandwidth) versus core bound (execution port pressure, scheduler or reorder buffer limits). Arm Neoverse guidance uses memory versus core style breakdown with different PMU event groupings and separates long-latency data access from execution resource contention.
41
+
42
+
The methodology allows you to focus on the dominant bottleneck category, avoiding the complexity of analyzing all possible performance issues at the same time.
31
43
32
44
The next sections compare the Intel x86 methodology with the Arm top-down methodology.
33
45
34
-
{{% notice Note %}}
46
+
{{% notice Notes %}}
47
+
This Learning Path uses the Arm Neoverse V2 when specific details are required, and some things will be different from other Neoverse N and Neoverse V processors.
48
+
35
49
AMD also has an equivalent top-down methodology which is similar to Intel, but uses different counters and calculations.
title: "Understand Intel x86 multi-level hierarchical top-down analysis"
2
+
title: "Understand Intel x86 multilevel hierarchical top-down analysis"
3
3
weight: 4
4
4
5
5
### FIXED, DO NOT MODIFY
@@ -8,62 +8,99 @@ layout: learningpathall
8
8
9
9
## Configure slot-based accounting with Intel x86 PMU counters
10
10
11
-
Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process micro-operations (uops). More slots means more work can be done per cycle. The number of slots depends on the microarchitecture design, but current Intel processor designs typically have four issue slots per cycle.
11
+
Intel uses a slot-based accounting model, where each CPU cycle provides multiple issue slots.
12
12
13
-
Intel's methodology uses a multi-level hierarchy that typically extends to 3-4 levels of detail. Each level provides progressively more granular analysis, allowing you to drill down from high-level categories to specific microarchitecture events.
13
+
A slot is a hardware resource that represents one opportunity for a microoperation (μop) to issue for execution during a single clock cycle.
Each cycle, the core exposes a fixed number of these issue opportunities, and this is known as the machine width in Intel’s Top-Down Microarchitecture Analysis Methodology (TMAM). You may also see the methodology referred to as TMA.
16
16
17
-
At Level 1, all pipeline slots are attributed to one of four categories, providing a high-level view of whether the CPU is doing useful work or stalling.
17
+
The total number of available slots is defined as:
18
18
19
-
- Retiring = `UOPS_RETIRED.RETIRE_SLOTS / SLOTS`
20
-
- Bad Speculation = `(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + N * RECOVERY_CYCLES) / SLOTS`
Where `SLOTS = 4 * CPU_CLK_UNHALTED.THREAD` on most Intel cores.
21
+
The machine width corresponds to the maximum number of μops that a core can issue to execution pipelines per cycle.
25
22
26
-
## Level 2: Analyze broader bottleneck causes
23
+
- Intel cores such as Skylake and Cascade Lake are 4-wide.
24
+
- Newer server and client cores such as Sapphire Rapids, Emerald Rapids, Granite Rapids, and Meteor Lake P-cores are 6-wide.
25
+
- Future generations may widen further, but the slot-based framework remains the same.
27
26
28
-
Once you've identified the dominant Level 1 category, Level 2 drills into each area to identify broader causes. This level distinguishes between frontend latency and bandwidth limits, or between memory and core execution stalls in the backend.
27
+
Tools such as `perf topdown` automatically apply the correct machine width for the detected CPU.
29
28
30
-
- Frontend Bound covers frontend latency compared with frontend bandwidth
31
-
- Backend Bound covers memory bound compared with core bound
32
-
- Bad Speculation covers branch mispredicts compared with machine clears
33
-
- Retiring covers base compared with microcode sequencer
29
+
Intel’s methodology uses a multi-level hierarchy that typically extends to three or four levels of detail. Each level provides progressively finer analysis, allowing you to drill down from high-level categories to specific hardware events.
34
30
35
-
## Level 3: Target specific microarchitecture bottlenecks
After identifying broader cause categories in Level 2, Level 3 provides fine-grained attribution that pinpoints specific bottlenecks like DRAM latency, cache misses, or port contention. This precision makes it possible to identify the exact root cause and apply targeted optimizations.
33
+
At Level 1, all pipeline slots are attributed to one of four categories, giving a high-level view of how the CPU’s issue capacity is being used:
38
34
39
-
Memory Bound expands into detailed cache hierarchy analysis including L1 Bound, L2 Bound, L3 Bound, DRAM Bound, and Store Bound categories. Core Bound breaks down into execution unit constraints such as Divider and Ports Utilization, along with many other specific microarchitecture-level categories that enable precise performance tuning.
Most workflows compute Backend Bound as the residual after Retiring, Frontend Bound, and Bad Speculation are accounted for.
42
41
43
-
Level 4 provides direct access to the specific microarchitecture events that cause the inefficiencies. At this level, you work directly with raw PMU counter values to understand the underlying hardware behavior causing performance bottlenecks. This enables precise tuning by identifying exactly which execution units, cache levels, or pipeline stages are limiting performance, allowing you to apply targeted code optimizations or hardware configuration changes.
42
+
### Level 2: Analyze broader bottleneck causes
44
43
45
-
## Apply essential Intel x86 PMU counters for analysis
44
+
Once the dominant Level 1 category is identified, Level 2 separates each category into groups:
46
45
47
-
Intel processors expose hundreds of performance events, but top-down analysis relies on a core set of counters that map directly to the four-level hierarchy:
46
+
| Category | Level 2 Sub-Categories | Purpose |
47
+
|-----------|------------------------|----------|
48
+
| Frontend Bound | Frontend Latency vs Frontend Bandwidth | Distinguish instruction-fetch delays from decode or μop cache throughput limits. |
49
+
| Backend Bound | Memory Bound vs Core Bound | Separate stalls caused by memory hierarchy latency/bandwidth from those caused by execution-unit contention or scheduler pressure. |
50
+
| Bad Speculation | Branch Mispredict vs Machine Clears | Identify speculation waste due to control-flow mispredictions or pipeline clears. |
51
+
| Retiring | Base vs Microcode Sequencer | Show the proportion of useful work from regular instructions versus microcoded sequences. |
|`MACHINE_CLEARS.COUNT`| Pipeline clears due to faults or ordering |
80
+
|`CYCLE_ACTIVITY.STALLS_TOTAL`| Total stall cycles |
81
+
|`CYCLE_ACTIVITY.STALLS_MEM_ANY`| Stalls from memory hierarchy misses |
82
+
|`CYCLE_ACTIVITY.STALLS_L1D_MISS`| Stalls due to L1 data-cache misses |
83
+
|`CYCLE_ACTIVITY.STALLS_L2_MISS`| Stalls waiting on L2 cache misses |
84
+
|`CYCLE_ACTIVITY.STALLS_L3_MISS`| Stalls waiting on last-level cache misses |
85
+
|`MEM_LOAD_RETIRED.L1_HIT / L2_HIT / L3_HIT`| Track where loads are satisfied in the cache hierarchy |
86
+
|`MEM_LOAD_RETIRED.L3_MISS`| Loads missing the LLC and going to memory |
87
+
|`MEM_LOAD_RETIRED.DRAM_HIT`| Loads serviced by DRAM |
88
+
|`OFFCORE_RESPONSE.*`| Detailed classification of off-core responses (L3 vs DRAM, local vs remote socket) |
89
+
90
+
Some events (for example, CYCLE_ACTIVITY.* and MEM_LOAD_RETIRED.*) vary across microarchitectures so you should confirm them on your CPU.
91
+
92
+
### Practical guidance
93
+
94
+
Here are some practical steps to keep in mind:
95
+
96
+
- Normalize all metrics to total slots: machine_width × CPU_CLK_UNHALTED.THREAD.
97
+
- Start at Level 1 to identify the dominant bottleneck.
98
+
- Drill down progressively through Levels 2 and 3 to isolate the root cause.
99
+
- Use raw events (Level 4) for detailed validation or hardware-tuning analysis.
100
+
- Check event availability before configuring counters on different CPU generations.
101
+
102
+
## Summary
103
+
104
+
Intel's Top-Down methodology provides a structured, slot-based framework for understanding pipeline efficiency. Each slot represents a potential μop issue opportunity.
105
+
106
+
By attributing every slot to one of the four categories you can measure how effectively a core executes useful work versus wasting cycles on stalls or speculation.
0 commit comments