Skip to content

Commit 89f9abb

Browse files
Merge pull request #2495 from jasonrandrews/review
Update top-down performance Learning Path
2 parents eb16b34 + 4eb2d94 commit 89f9abb

File tree

6 files changed

+420
-281
lines changed

6 files changed

+420
-281
lines changed

content/learning-paths/cross-platform/topdown-compare/1-top-down.md

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,31 +6,46 @@ weight: 3
66
layout: learningpathall
77
---
88

9-
## What are the differences between Arm and Intel x86 PMU counters?
9+
## What are the differences between Arm Neoverse and Intel x86 PMU counters?
1010

11-
This is a common question from both software developers and performance engineers working across architectures.
11+
This is a common question from both software developers and performance engineers working with multiple architectures.
1212

13-
Both Intel x86 and Arm Neoverse CPUs provide sophisticated Performance Monitoring Units (PMUs) with hundreds of hardware counters. Instead of trying to list all available counters and compare microarchitecture, it makes more sense to focus on the performance methodologies they enable and the calculations used for performance metrics.
13+
Both Intel x86 and Arm Neoverse CPUs provide sophisticated Performance Monitoring Units (PMUs) with hundreds of hardware counters. Instead of trying to list all available counters and compare microarchitectures, it makes more sense to focus on the performance methodologies they enable and the calculations used for performance metrics.
1414

15-
While the specific counter names and formulas differ between architectures, both Intel x86 and Arm Neoverse have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four key areas:
15+
Although counter names and formulas differ, both Intel x86 and Arm Neoverse classify performance bottlenecks into the same four top-level categories:
1616

1717
- Retiring
1818
- Bad Speculation
1919
- Frontend Bound
2020
- Backend Bound
2121

22-
This Learning Path provides a comparison of how x86 processors implement multi-level hierarchical top-down analysis compared to Arm Neoverse's methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas.
22+
The first step is to focus on the dominant top-level bucket. Then, on Intel x86 you descend through the formal sub-levels. On Arm, you derive similar insights using architecture-specific event groups and formulas that approximate those sub-divisions.
23+
24+
This Learning Path compares Intel x86 Top-down Microarchitecture Analysis (a formal multi-level hierarchy) with Arm Neoverse top-down guidance (the same four level-1 buckets, but fewer standardized sub-levels). You will learn how the approaches align conceptually while noting differences in PMU event semantics and machine width.
2325

2426
## Introduction to top-down performance analysis
2527

26-
The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of the four categories.
28+
The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline utilization. Instead of trying to interpret dozens of metrics, you can systematically identify bottlenecks by attributing CPU pipeline activity to one of the four categories.
29+
30+
A slot represents one potential opportunity for a processor core to issue and execute a micro-operation (µop) during a single clock cycle.
31+
The total slots = (machine width × cycles), where each slot can be used productively or wasted through speculation or stalls.
32+
33+
**Retiring** represents slots that retire useful instructions (µops).
34+
35+
**Bad Speculation** accounts for slots consumed by mispredicted branches, pipeline flushes, or other speculative work that does not retire.
36+
On Intel x86, this includes machine clears and on Arm Neoverse it is modeled through misprediction and refill events.
2737

28-
**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches and pipeline flushes. **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, whereas **Backend Bound** covers slots stalled by execution resource constraints such as cache misses or arithmetic unit availability.
38+
**Frontend Bound** identifies slots lost because the core cannot supply enough decoded micro-ops. On Intel this subdivides into frontend latency (instruction cache, ITLB, branch predictor) versus frontend bandwidth (µop supply limits). On Arm Neoverse you approximate similar causes with instruction fetch, branch, and L1 I-cache events.
2939

30-
The methodology allows you to drill down only into the dominant bottleneck category, avoiding the complexity of analyzing all possible performance issues at the same time.
40+
**Backend Bound** covers slots where issued micro-ops wait on data or execution resources. Intel x86 subdivides this into memory bound (cache / memory hierarchy latency or bandwidth) versus core bound (execution port pressure, scheduler or reorder buffer limits). Arm Neoverse guidance uses memory versus core style breakdown with different PMU event groupings and separates long-latency data access from execution resource contention.
41+
42+
The methodology allows you to focus on the dominant bottleneck category, avoiding the complexity of analyzing all possible performance issues at the same time.
3143

3244
The next sections compare the Intel x86 methodology with the Arm top-down methodology.
3345

34-
{{% notice Note %}}
46+
{{% notice Notes %}}
47+
This Learning Path uses the Arm Neoverse V2 when specific details are required, and some things will be different from other Neoverse N and Neoverse V processors.
48+
3549
AMD also has an equivalent top-down methodology which is similar to Intel, but uses different counters and calculations.
3650
{{% /notice %}}
51+
Lines changed: 80 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Understand Intel x86 multi-level hierarchical top-down analysis"
2+
title: "Understand Intel x86 multilevel hierarchical top-down analysis"
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
@@ -8,62 +8,99 @@ layout: learningpathall
88

99
## Configure slot-based accounting with Intel x86 PMU counters
1010

11-
Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process micro-operations (uops). More slots means more work can be done per cycle. The number of slots depends on the microarchitecture design, but current Intel processor designs typically have four issue slots per cycle.
11+
Intel uses a slot-based accounting model, where each CPU cycle provides multiple issue slots.
1212

13-
Intel's methodology uses a multi-level hierarchy that typically extends to 3-4 levels of detail. Each level provides progressively more granular analysis, allowing you to drill down from high-level categories to specific microarchitecture events.
13+
A slot is a hardware resource that represents one opportunity for a microoperation (μop) to issue for execution during a single clock cycle.
1414

15-
## Level 1: Identify top-level performance categories
15+
Each cycle, the core exposes a fixed number of these issue opportunities, and this is known as the machine width in Intel’s Top-Down Microarchitecture Analysis Methodology (TMAM). You may also see the methodology referred to as TMA.
1616

17-
At Level 1, all pipeline slots are attributed to one of four categories, providing a high-level view of whether the CPU is doing useful work or stalling.
17+
The total number of available slots is defined as:
1818

19-
- Retiring = `UOPS_RETIRED.RETIRE_SLOTS / SLOTS`
20-
- Bad Speculation = `(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + N * RECOVERY_CYCLES) / SLOTS`
21-
- Frontend Bound = `IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS`
22-
- Backend Bound = `1 - (Frontend + Bad Spec + Retiring)`
19+
`Total_SLOTS = machine_width × CPU_CLK_UNHALTED.THREAD`
2320

24-
Where `SLOTS = 4 * CPU_CLK_UNHALTED.THREAD` on most Intel cores.
21+
The machine width corresponds to the maximum number of μops that a core can issue to execution pipelines per cycle.
2522

26-
## Level 2: Analyze broader bottleneck causes
23+
- Intel cores such as Skylake and Cascade Lake are 4-wide.
24+
- Newer server and client cores such as Sapphire Rapids, Emerald Rapids, Granite Rapids, and Meteor Lake P-cores are 6-wide.
25+
- Future generations may widen further, but the slot-based framework remains the same.
2726

28-
Once you've identified the dominant Level 1 category, Level 2 drills into each area to identify broader causes. This level distinguishes between frontend latency and bandwidth limits, or between memory and core execution stalls in the backend.
27+
Tools such as `perf topdown` automatically apply the correct machine width for the detected CPU.
2928

30-
- Frontend Bound covers frontend latency compared with frontend bandwidth
31-
- Backend Bound covers memory bound compared with core bound
32-
- Bad Speculation covers branch mispredicts compared with machine clears
33-
- Retiring covers base compared with microcode sequencer
29+
Intel’s methodology uses a multi-level hierarchy that typically extends to three or four levels of detail. Each level provides progressively finer analysis, allowing you to drill down from high-level categories to specific hardware events.
3430

35-
## Level 3: Target specific microarchitecture bottlenecks
31+
### Level 1: Identify top-level performance categories
3632

37-
After identifying broader cause categories in Level 2, Level 3 provides fine-grained attribution that pinpoints specific bottlenecks like DRAM latency, cache misses, or port contention. This precision makes it possible to identify the exact root cause and apply targeted optimizations.
33+
At Level 1, all pipeline slots are attributed to one of four categories, giving a high-level view of how the CPU’s issue capacity is being used:
3834

39-
Memory Bound expands into detailed cache hierarchy analysis including L1 Bound, L2 Bound, L3 Bound, DRAM Bound, and Store Bound categories. Core Bound breaks down into execution unit constraints such as Divider and Ports Utilization, along with many other specific microarchitecture-level categories that enable precise performance tuning.
35+
- Retiring = UOPS_RETIRED.RETIRE_SLOTS / SLOTS
36+
- Frontend Bound = IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS
37+
- Bad Speculation = derived from speculative flush behavior (branch mispredictions and machine clears) or computed residually
38+
- Backend Bound = 1 − (Retiring + Frontend Bound + Bad Speculation)
4039

41-
## Level 4: Access specific PMU counter events
40+
Most workflows compute Backend Bound as the residual after Retiring, Frontend Bound, and Bad Speculation are accounted for.
4241

43-
Level 4 provides direct access to the specific microarchitecture events that cause the inefficiencies. At this level, you work directly with raw PMU counter values to understand the underlying hardware behavior causing performance bottlenecks. This enables precise tuning by identifying exactly which execution units, cache levels, or pipeline stages are limiting performance, allowing you to apply targeted code optimizations or hardware configuration changes.
42+
### Level 2: Analyze broader bottleneck causes
4443

45-
## Apply essential Intel x86 PMU counters for analysis
44+
Once the dominant Level 1 category is identified, Level 2 separates each category into groups:
4645

47-
Intel processors expose hundreds of performance events, but top-down analysis relies on a core set of counters that map directly to the four-level hierarchy:
46+
| Category | Level 2 Sub-Categories | Purpose |
47+
|-----------|------------------------|----------|
48+
| Frontend Bound | Frontend Latency vs Frontend Bandwidth | Distinguish instruction-fetch delays from decode or μop cache throughput limits. |
49+
| Backend Bound | Memory Bound vs Core Bound | Separate stalls caused by memory hierarchy latency/bandwidth from those caused by execution-unit contention or scheduler pressure. |
50+
| Bad Speculation | Branch Mispredict vs Machine Clears | Identify speculation waste due to control-flow mispredictions or pipeline clears. |
51+
| Retiring | Base vs Microcode Sequencer | Show the proportion of useful work from regular instructions versus microcoded sequences. |
4852

49-
| Event Name | Purpose |
50-
| :---------------------------------------------- | :----------------------------------------------------------------------------------- |
51-
| `UOPS_RETIRED.RETIRE_SLOTS` | Count retired micro-operations (Retiring) |
52-
| `UOPS_ISSUED.ANY` | Count issued micro-operations (helps quantify Bad Speculation) |
53-
| `IDQ_UOPS_NOT_DELIVERED.CORE` | Frontend delivery failures (Frontend Bound) |
54-
| `CPU_CLK_UNHALTED.THREAD` | Core clock cycles (baseline for normalization) |
55-
| `BR_MISP_RETIRED.ALL_BRANCHES` | Branch mispredictions (Bad Speculation detail) |
56-
| `MACHINE_CLEARS.COUNT` | Pipeline clears due to memory ordering or faults (Bad Speculation detail) |
57-
| `CYCLE_ACTIVITY.STALLS_TOTAL` | Total stall cycles (baseline for backend breakdown) |
58-
| `CYCLE_ACTIVITY.STALLS_MEM_ANY` | Aggregate stalls from memory hierarchy misses (Backend → Memory Bound) |
59-
| `CYCLE_ACTIVITY.STALLS_L1D_MISS` | Stalls due to L1 data cache misses |
60-
| `CYCLE_ACTIVITY.STALLS_L2_MISS` | Stalls waiting on L2 cache misses |
61-
| `CYCLE_ACTIVITY.STALLS_L3_MISS` | Stalls waiting on last-level cache misses |
62-
| `MEM_LOAD_RETIRED.L1_HIT` / `L2_HIT` / `L3_HIT` | Track where loads are satisfied in the cache hierarchy |
63-
| `MEM_LOAD_RETIRED.L3_MISS` | Loads missing LLC and going to memory |
64-
| `MEM_LOAD_RETIRED.DRAM_HIT` | Loads serviced by DRAM (DRAM Bound detail) |
65-
| `OFFCORE_RESPONSE.*` | Detailed classification of off-core responses (L3 vs. DRAM, local vs. remote socket) |
66-
67-
68-
Using the above levels of metrics, you can determine which of the four top-level categories are causing bottlenecks.
53+
### Level 3: Target specific microarchitecture bottlenecks
54+
55+
Level 3 provides fine-grained attribution that pinpoints precise hardware limitations.
56+
57+
Examples include:
58+
- Memory Bound covers L1 Bound, L2 Bound, L3 Bound, DRAM Bound, Store Bound
59+
- Core Bound covers execution-port pressure, divider utilization, scheduler or ROB occupancy
60+
- Frontend latency covers instruction-cache misses, ITLB walks, branch-prediction misses
61+
- Frontend bandwidth covers decode throughput or μop cache saturation
62+
63+
At this level, you can determine whether workloads are limited by memory latency, cache hierarchy bandwidth, or execution-resource utilization.
6964

65+
### Level 4: Access specific PMU counter events
66+
67+
Level 4 exposes the Performance Monitoring Unit (PMU) events that implement the hierarchy.
68+
69+
Here you analyze raw event counts to understand detailed pipeline behavior.
70+
Event names and availability vary by microarchitecture, but you can verify them with `perf list`.
71+
72+
| Event Name | Purpose |
73+
| :---------------------------------------------- | :----------------------------------------------------------------------------------- |
74+
| `UOPS_RETIRED.RETIRE_SLOTS` | Counts retired μops |
75+
| `UOPS_ISSUED.ANY` | Counts all issued μops (used in speculation analysis) |
76+
| `IDQ_UOPS_NOT_DELIVERED.CORE` | Counts μops not delivered from frontend |
77+
| `CPU_CLK_UNHALTED.THREAD` | Core clock cycles (baseline for normalization) |
78+
| `BR_MISP_RETIRED.ALL_BRANCHES` | Branch mispredictions |
79+
| `MACHINE_CLEARS.COUNT` | Pipeline clears due to faults or ordering |
80+
| `CYCLE_ACTIVITY.STALLS_TOTAL` | Total stall cycles |
81+
| `CYCLE_ACTIVITY.STALLS_MEM_ANY` | Stalls from memory hierarchy misses |
82+
| `CYCLE_ACTIVITY.STALLS_L1D_MISS` | Stalls due to L1 data-cache misses |
83+
| `CYCLE_ACTIVITY.STALLS_L2_MISS` | Stalls waiting on L2 cache misses |
84+
| `CYCLE_ACTIVITY.STALLS_L3_MISS` | Stalls waiting on last-level cache misses |
85+
| `MEM_LOAD_RETIRED.L1_HIT / L2_HIT / L3_HIT` | Track where loads are satisfied in the cache hierarchy |
86+
| `MEM_LOAD_RETIRED.L3_MISS` | Loads missing the LLC and going to memory |
87+
| `MEM_LOAD_RETIRED.DRAM_HIT` | Loads serviced by DRAM |
88+
| `OFFCORE_RESPONSE.*` | Detailed classification of off-core responses (L3 vs DRAM, local vs remote socket) |
89+
90+
Some events (for example, CYCLE_ACTIVITY.* and MEM_LOAD_RETIRED.*) vary across microarchitectures so you should confirm them on your CPU.
91+
92+
### Practical guidance
93+
94+
Here are some practical steps to keep in mind:
95+
96+
- Normalize all metrics to total slots: machine_width × CPU_CLK_UNHALTED.THREAD.
97+
- Start at Level 1 to identify the dominant bottleneck.
98+
- Drill down progressively through Levels 2 and 3 to isolate the root cause.
99+
- Use raw events (Level 4) for detailed validation or hardware-tuning analysis.
100+
- Check event availability before configuring counters on different CPU generations.
101+
102+
## Summary
103+
104+
Intel's Top-Down methodology provides a structured, slot-based framework for understanding pipeline efficiency. Each slot represents a potential μop issue opportunity.
105+
106+
By attributing every slot to one of the four categories you can measure how effectively a core executes useful work versus wasting cycles on stalls or speculation.

0 commit comments

Comments
 (0)