Skip to content

Commit 2ec09ef

Browse files
Merge pull request #2391 from madeline-underwood/review
topdown_JA to review
2 parents 7d1e63a + 1d4e4f3 commit 2ec09ef

File tree

6 files changed

+220
-212
lines changed

6 files changed

+220
-212
lines changed
Lines changed: 14 additions & 178 deletions
Original file line numberDiff line numberDiff line change
@@ -1,197 +1,33 @@
11
---
2-
title: Top-down performance analysis
2+
title: "Analyze Intel x86 and Arm Neoverse top-down performance methodologies"
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## What are the differences between Arm and x86 PMU counters?
9+
## What are the differences between Arm and Intel x86 PMU counters?
1010

11-
This is a common question from software developers and performance engineers.
11+
This is a common question from both software developers and performance engineers working across architectures.
1212

13-
Both Arm and x86 CPUs provide sophisticated Performance Monitoring Units (PMUs) with hundreds of hardware counters. Instead of trying to list all available counters and compare microarchitecture, it makes more sense to focus on the performance methodologies they enable and the calculations used for performance metrics.
13+
Both Intel x86 and Arm Neoverse CPUs provide sophisticated Performance Monitoring Units (PMUs) with hundreds of hardware counters. Instead of trying to list all available counters and compare microarchitecture, it makes more sense to focus on the performance methodologies they enable and the calculations used for performance metrics.
1414

15-
While the specific counter names and formulas differ between architectures, both have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four buckets: Retiring, Bad Speculation, Frontend Bound, and Backend Bound.
15+
While the specific counter names and formulas differ between architectures, both Intel x86 and Arm Neoverse have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four key areas:
1616

17-
This Learning Path provides a comparison of how Arm and x86 processors implement top-down
18-
analysis, highlighting the similarities in approach while explaining the architectural differences in counter events and formulas.
17+
**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches. Additionally, **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, and **Backend Bound** covers slots stalled by execution resource constraints.
18+
19+
This Learning Path provides a comparison of how x86 processors implement four-level hierarchical top-down analysis compared to Arm Neoverse's two-stage methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas.
1920

2021
## Introduction to top-down performance analysis
2122

22-
Top-down methodology makes performance analysis easier by shifting focus from individual performance
23-
counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories.
23+
The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories.
2424

25-
- Retiring: pipeline slots that successfully complete useful work
26-
- Bad Speculation: slots wasted on mispredicted branches
27-
- Frontend Bound: slots stalled due to instruction fetch/decode limitations
28-
- Backend Bound: slots stalled due to execution resource constraints
25+
**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches and pipeline flushes. **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, whereas **Backend Bound** covers slots stalled by execution resource constraints such as cache misses or arithmetic unit availability.
2926

3027
The methodology uses a hierarchical approach that allows you to drill down only into the dominant bottleneck category, and avoid the complexity of analyzing all possible performance issues at the same time.
3128

32-
The next sections compare the Intel x86 methodology with the Arm top-down methodology. AMD also has an equivalent top-down methodology which is similar to Intel, but uses different counters and calculations.
33-
34-
## Intel x86 top-down methodology
35-
36-
Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process operations. More slots means more work can be done. The number of slots depends on the design but current processor designs have 4, 6, or 8 slots.
37-
38-
### Hierarchical Structure
39-
40-
Intel uses a multi-level hierarchy that typically extends to 4 levels of detail.
41-
42-
**Level 1 (Top-Level):**
43-
44-
At Level 1, all pipeline slots are attributed to one of four categories, providing a high-level view of whether the CPU is doing useful work or stalling.
45-
46-
- Retiring = `UOPS_RETIRED.RETIRE_SLOTS / SLOTS`
47-
- Bad Speculation = `(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + N * RECOVERY_CYCLES) / SLOTS`
48-
- Frontend Bound = `IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS`
49-
- Backend Bound = `1 - (Frontend + Bad Spec + Retiring)`
50-
51-
Where `SLOTS = 4 * CPU_CLK_UNHALTED.THREAD` on most Intel cores.
52-
53-
**Level 2 breakdown:**
54-
55-
Level 2 drills into each of these to identify broader causes, such as distinguishing between frontend latency and bandwidth limits, or between memory and core execution stalls in the backend.
56-
57-
- Frontend Bound covers frontend latency vs. frontend bandwidth
58-
- Backend Bound covers memory bound vs. core bound
59-
- Bad Speculation covers branch mispredicts vs. machine clears
60-
- Retiring covers base vs. microcode sequencer
61-
62-
**Level 3 breakdown:**
63-
64-
Level 3 provides fine-grained attribution, pinpointing specific bottlenecks like DRAM latency, cache misses, or port contention, which makes it possible to identify the exact root cause and apply targeted optimizations.
65-
66-
- Memory Bound includes L1 Bound, L2 Bound, L3 Bound, DRAM Bound, Store Bound
67-
- Core Bound includes Divider, Ports Utilization
68-
- And many more specific categories
69-
70-
**Level 4 breakdown:**
71-
72-
Level 4 provides the specific microarchitecture events that cause the inefficiencies.
73-
74-
### Key Performance Events
75-
76-
Intel processors expose hundreds of performance events, but top-down analysis relies on a core set:
77-
78-
| Event Name | Purpose |
79-
| :---------------------------------------------- | :----------------------------------------------------------------------------------- |
80-
| `UOPS_RETIRED.RETIRE_SLOTS` | Count retired micro-operations (Retiring) |
81-
| `UOPS_ISSUED.ANY` | Count issued micro-operations (helps quantify Bad Speculation) |
82-
| `IDQ_UOPS_NOT_DELIVERED.CORE` | Frontend delivery failures (Frontend Bound) |
83-
| `CPU_CLK_UNHALTED.THREAD` | Core clock cycles (baseline for normalization) |
84-
| `BR_MISP_RETIRED.ALL_BRANCHES` | Branch mispredictions (Bad Speculation detail) |
85-
| `MACHINE_CLEARS.COUNT` | Pipeline clears due to memory ordering or faults (Bad Speculation detail) |
86-
| `CYCLE_ACTIVITY.STALLS_TOTAL` | Total stall cycles (baseline for backend breakdown) |
87-
| `CYCLE_ACTIVITY.STALLS_MEM_ANY` | Aggregate stalls from memory hierarchy misses (Backend → Memory Bound) |
88-
| `CYCLE_ACTIVITY.STALLS_L1D_MISS` | Stalls due to L1 data cache misses |
89-
| `CYCLE_ACTIVITY.STALLS_L2_MISS` | Stalls waiting on L2 cache misses |
90-
| `CYCLE_ACTIVITY.STALLS_L3_MISS` | Stalls waiting on last-level cache misses |
91-
| `MEM_LOAD_RETIRED.L1_HIT` / `L2_HIT` / `L3_HIT` | Track where loads are satisfied in the cache hierarchy |
92-
| `MEM_LOAD_RETIRED.L3_MISS` | Loads missing LLC and going to memory |
93-
| `MEM_LOAD_RETIRED.DRAM_HIT` | Loads serviced by DRAM (DRAM Bound detail) |
94-
| `OFFCORE_RESPONSE.*` | Detailed classification of off-core responses (L3 vs. DRAM, local vs. remote socket) |
95-
96-
97-
Using the above levels of metrics you can find out which of the 4 top-level categories are causing bottlenecks.
98-
99-
### Arm top-down methodology
100-
101-
Arm developed a similar top-down methodology for Neoverse server cores. The Arm architecture uses an 8-slot rename unit for pipeline bandwidth accounting.
102-
103-
### Two-Stage Approach
104-
105-
Unlike Intel's hierarchical model, Arm employs a two-stage methodology:
106-
107-
**Stage 1: Topdown analysis**
108-
109-
- Identifies high-level bottlenecks using the same four categories
110-
- Uses Arm-specific PMU events and formulas
111-
- Slot-based accounting similar to Intel but with Arm event names
112-
113-
**Stage 2: Micro-architecture exploration**
114-
115-
- Resource-specific effectiveness metrics grouped by CPU component
116-
- Industry-standard metrics like MPKI (Misses Per Kilo Instructions)
117-
- Detailed breakdown without strict hierarchical drilling
118-
119-
### Stage 1 formulas
120-
121-
Arm uses different top-down metrics based on different events but the concept is similar.
122-
123-
| Metric | Formula | Purpose |
124-
| :-- | :-- | :-- |
125-
| Backend bound | `100 * (STALL_SLOT_BACKEND / (CPU_CYCLES * 8))` | Backend resource constraints |
126-
| Frontend bound | `100 * ((STALL_SLOT_FRONTEND / (CPU_CYCLES * 8)) - (BR_MIS_PRED / (4 * CPU_CYCLES)))` | Frontend delivery issues |
127-
| Bad speculation | `100 * (1 - (OP_RETIRED/OP_SPEC)) * (1 - (STALL_SLOT/(CPU_CYCLES * 8))) + (BR_MIS_PRED / (4 * CPU_CYCLES))` | Misprediction recovery |
128-
| Retiring | `100 * (OP_RETIRED/OP_SPEC) * (1 - (STALL_SLOT/(CPU_CYCLES * 8)))` | Useful work completed |
129-
130-
### Stage 2 resource groups
131-
132-
Instead of hierarchical levels, Arm organizes detailed metrics into effectiveness groups as shown below:
133-
134-
- Branch Effectiveness: Misprediction rates, MPKI
135-
- ITLB/DTLB Effectiveness: Translation lookaside buffer efficiency
136-
- L1I/L1D/L2/LL Cache Effectiveness: Cache hit ratios and MPKI
137-
- Operation Mix: Breakdown of instruction types (SIMD, integer, load/store)
138-
- Cycle Accounting: Frontend vs. backend stall percentages
139-
140-
### Key performance events
141-
142-
Neoverse cores expose approximately 100 hardware events optimized for server workloads, including:
143-
144-
| Event Name | Purpose / Usage |
145-
| :-------------------- | :--------------------------------------------------------------------------------------- |
146-
| `CPU_CYCLES` | Core clock cycles (baseline for normalization). |
147-
| `OP_SPEC` | Speculatively executed micro-operations (used as slot denominator). |
148-
| `OP_RETIRED` | Retired micro-operations (used to measure useful work). |
149-
| `INST_RETIRED` | Instructions retired (architectural measure; used for IPC, MPKI normalization). |
150-
| `INST_SPEC` | Instructions speculatively executed (needed for operation mix and speculation analysis). |
151-
| `STALL_SLOT` | Total stall slots (foundation for efficiency metrics). |
152-
| `STALL_SLOT_FRONTEND` | Stall slots due to frontend resource constraints. |
153-
| `STALL_SLOT_BACKEND` | Stall slots due to backend resource constraints. |
154-
| `BR_RETIRED` | Branches retired (baseline for branch misprediction ratio). |
155-
| `BR_MIS_PRED_RETIRED` | Mispredicted branches retired (branch effectiveness, speculation waste). |
156-
| `L1I_CACHE_REFILL` | Instruction cache refills (frontend stalls due to I-cache misses). |
157-
| `ITLB_WALK` | Instruction TLB walks (frontend stalls due to translation). |
158-
| `L1D_CACHE_REFILL` | Data cache refills (backend stalls due to L1D misses). |
159-
| `L2D_CACHE_REFILL` | Unified L2 cache refills (backend stalls from L2 misses). |
160-
| `LL_CACHE_MISS_RD` | Last-level/system cache read misses (backend stalls from LLC/memory). |
161-
| `DTLB_WALK` | Data TLB walks (backend stalls due to translation). |
162-
| `MEM_ACCESS` | Total memory accesses (baseline for cache/TLB effectiveness ratios). |
163-
164-
165-
## Arm compared to x86
166-
167-
### Conceptual similarities
168-
169-
Both architectures adhere to the same fundamental top-down performance analysis philosophy:
170-
171-
1. Four-category classification: Retiring, Bad Speculation, Frontend Bound, Backend Bound
172-
2. Slot-based accounting: Pipeline utilization measured in issue or rename slots
173-
3. Hierarchical analysis: Broad classification followed by drill-down into dominant bottlenecks
174-
4. Resource attribution: Map performance issues to specific CPU micro-architectural components
175-
176-
### Key Differences
177-
178-
| Aspect | x86 Intel | Arm Neoverse |
179-
| :-- | :-- | :-- |
180-
| Hierarchy Model | Multi-level tree (Level 1 → Level 2 → Level 3+) | Two-stage: Topdown Level 1 + Resource Groups |
181-
| Slot Width | 4 issue slots per cycle (typical) | 8 rename slots per cycle (Neoverse V1) |
182-
| Formula Basis | Micro-operation (uop) centric | Operation and cycle centric |
183-
| Event Naming | Intel-specific mnemonics | Arm-specific mnemonics |
184-
| Drill-down Strategy | Strict hierarchical descent | Exploration by resource groups |
185-
186-
### Event Mapping Examples
187-
188-
| Performance Question | x86 Intel Events | Arm Neoverse Events |
189-
| :-- | :-- | :-- |
190-
| Frontend bound? | `IDQ_UOPS_NOT_DELIVERED.*` | `STALL_SLOT_FRONTEND` |
191-
| Bad speculation? | `BR_MISP_RETIRED.*` | `BR_MIS_PRED_RETIRED` |
192-
| Memory bound? | `CYCLE_ACTIVITY.STALLS_L3_MISS` | `L1D_CACHE_REFILL`, `L2D_CACHE_REFILL` |
193-
| Cache effectiveness? | `MEM_LOAD_RETIRED.L3_MISS_PS` | Cache refill metrics / Cache access metrics |
194-
195-
While it doesn't make sense to directly compare PMU counters for the Arm and x86 architectures, it is useful to understand the top-down methodologies for each so you can do effective performance analysis and compare you code running on each architecture.
29+
The next sections compare the Intel x86 methodology with the Arm top-down methodology.
19630

197-
Continue to the next step to try a code example.
31+
{{% notice Note %}}
32+
AMD also has an equivalent top-down methodology which is similar to Intel, but uses different counters and calculations.
33+
{{% /notice %}}

0 commit comments

Comments
 (0)