Skip to content

Commit fb53421

Browse files
Merge pull request #2346 from jasonrandrews/review
Compare Arm and x86 top-down performance analysis
2 parents 2208661 + b2656b0 commit fb53421

File tree

4 files changed

+532
-0
lines changed

4 files changed

+532
-0
lines changed
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
---
2+
title: Top-down performance analysis
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## What are the differences between Arm and x86 PMU counters?
10+
11+
This is a common question from software developers and performance engineers.
12+
13+
Both Arm and x86 CPUs provide sophisticated Performance Monitoring Units (PMUs) with hundreds of hardware counters. Instead of trying to list all available counters and compare microarchitecture, it makes more sense to focus on the performance methodologies they enable and the calculations used for performance metrics.
14+
15+
While the specific counter names and formulas differ between architectures, both have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four buckets: Retiring, Bad Speculation, Frontend Bound, and Backend Bound.
16+
17+
This Learning Path provides a comparison of how Arm and x86 processors implement top-down
18+
analysis, highlighting the similarities in approach while explaining the architectural differences in counter events and formulas.
19+
20+
## Introduction to top-down performance analysis
21+
22+
Top-down methodology makes performance analysis easier by shifting focus from individual performance
23+
counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories.
24+
25+
- Retiring: pipeline slots that successfully complete useful work
26+
- Bad Speculation: slots wasted on mispredicted branches
27+
- Frontend Bound: slots stalled due to instruction fetch/decode limitations
28+
- Backend Bound: slots stalled due to execution resource constraints
29+
30+
The methodology uses a hierarchical approach that allows you to drill down only into the dominant bottleneck category, and avoid the complexity of analyzing all possible performance issues at the same time.
31+
32+
The next sections compare the Intel x86 methodology with the Arm top-down methodology. AMD also has an equivalent top-down methodology which is similar to Intel, but uses different counters and calculations.
33+
34+
## Intel x86 top-down methodology
35+
36+
Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process operations. More slots means more work can be done. The number of slots depends on the design but current processor designs have 4, 6, or 8 slots.
37+
38+
### Hierarchical Structure
39+
40+
Intel uses a multi-level hierarchy that typically extends to 4 levels of detail.
41+
42+
**Level 1 (Top-Level):**
43+
44+
At Level 1, all pipeline slots are attributed to one of four categories, providing a high-level view of whether the CPU is doing useful work or stalling.
45+
46+
- Retiring = `UOPS_RETIRED.RETIRE_SLOTS / SLOTS`
47+
- Bad Speculation = `(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + N * RECOVERY_CYCLES) / SLOTS`
48+
- Frontend Bound = `IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS`
49+
- Backend Bound = `1 - (Frontend + Bad Spec + Retiring)`
50+
51+
Where `SLOTS = 4 * CPU_CLK_UNHALTED.THREAD` on most Intel cores.
52+
53+
**Level 2 breakdown:**
54+
55+
Level 2 drills into each of these to identify broader causes, such as distinguishing between frontend latency and bandwidth limits, or between memory and core execution stalls in the backend.
56+
57+
- Frontend Bound covers frontend latency vs. frontend bandwidth
58+
- Backend Bound covers memory bound vs. core bound
59+
- Bad Speculation covers branch mispredicts vs. machine clears
60+
- Retiring covers base vs. microcode sequencer
61+
62+
**Level 3 breakdown:**
63+
64+
Level 3 provides fine-grained attribution, pinpointing specific bottlenecks like DRAM latency, cache misses, or port contention, which makes it possible to identify the exact root cause and apply targeted optimizations.
65+
66+
- Memory Bound includes L1 Bound, L2 Bound, L3 Bound, DRAM Bound, Store Bound
67+
- Core Bound includes Divider, Ports Utilization
68+
- And many more specific categories
69+
70+
**Level 4 breakdown:**
71+
72+
Level 4 provides the specific microarchitecture events that cause the inefficiencies.
73+
74+
### Key Performance Events
75+
76+
Intel processors expose hundreds of performance events, but top-down analysis relies on a core set:
77+
78+
| Event Name | Purpose |
79+
| :---------------------------------------------- | :----------------------------------------------------------------------------------- |
80+
| `UOPS_RETIRED.RETIRE_SLOTS` | Count retired micro-operations (Retiring) |
81+
| `UOPS_ISSUED.ANY` | Count issued micro-operations (helps quantify Bad Speculation) |
82+
| `IDQ_UOPS_NOT_DELIVERED.CORE` | Frontend delivery failures (Frontend Bound) |
83+
| `CPU_CLK_UNHALTED.THREAD` | Core clock cycles (baseline for normalization) |
84+
| `BR_MISP_RETIRED.ALL_BRANCHES` | Branch mispredictions (Bad Speculation detail) |
85+
| `MACHINE_CLEARS.COUNT` | Pipeline clears due to memory ordering or faults (Bad Speculation detail) |
86+
| `CYCLE_ACTIVITY.STALLS_TOTAL` | Total stall cycles (baseline for backend breakdown) |
87+
| `CYCLE_ACTIVITY.STALLS_MEM_ANY` | Aggregate stalls from memory hierarchy misses (Backend → Memory Bound) |
88+
| `CYCLE_ACTIVITY.STALLS_L1D_MISS` | Stalls due to L1 data cache misses |
89+
| `CYCLE_ACTIVITY.STALLS_L2_MISS` | Stalls waiting on L2 cache misses |
90+
| `CYCLE_ACTIVITY.STALLS_L3_MISS` | Stalls waiting on last-level cache misses |
91+
| `MEM_LOAD_RETIRED.L1_HIT` / `L2_HIT` / `L3_HIT` | Track where loads are satisfied in the cache hierarchy |
92+
| `MEM_LOAD_RETIRED.L3_MISS` | Loads missing LLC and going to memory |
93+
| `MEM_LOAD_RETIRED.DRAM_HIT` | Loads serviced by DRAM (DRAM Bound detail) |
94+
| `OFFCORE_RESPONSE.*` | Detailed classification of off-core responses (L3 vs. DRAM, local vs. remote socket) |
95+
96+
97+
Using the above levels of metrics you can find out which of the 4 top-level categories are causing bottlenecks.
98+
99+
### Arm top-down methodology
100+
101+
Arm developed a similar top-down methodology for Neoverse server cores. The Arm architecture uses an 8-slot rename unit for pipeline bandwidth accounting.
102+
103+
### Two-Stage Approach
104+
105+
Unlike Intel's hierarchical model, Arm employs a two-stage methodology:
106+
107+
**Stage 1: Topdown analysis**
108+
109+
- Identifies high-level bottlenecks using the same four categories
110+
- Uses Arm-specific PMU events and formulas
111+
- Slot-based accounting similar to Intel but with Arm event names
112+
113+
**Stage 2: Micro-architecture exploration**
114+
115+
- Resource-specific effectiveness metrics grouped by CPU component
116+
- Industry-standard metrics like MPKI (Misses Per Kilo Instructions)
117+
- Detailed breakdown without strict hierarchical drilling
118+
119+
### Stage 1 formulas
120+
121+
Arm uses different top-down metrics based on different events but the concept is similar.
122+
123+
| Metric | Formula | Purpose |
124+
| :-- | :-- | :-- |
125+
| Backend bound | `100 * (STALL_SLOT_BACKEND / (CPU_CYCLES * 8))` | Backend resource constraints |
126+
| Frontend bound | `100 * ((STALL_SLOT_FRONTEND / (CPU_CYCLES * 8)) - (BR_MIS_PRED / (4 * CPU_CYCLES)))` | Frontend delivery issues |
127+
| Bad speculation | `100 * (1 - (OP_RETIRED/OP_SPEC)) * (1 - (STALL_SLOT/(CPU_CYCLES * 8))) + (BR_MIS_PRED / (4 * CPU_CYCLES))` | Misprediction recovery |
128+
| Retiring | `100 * (OP_RETIRED/OP_SPEC) * (1 - (STALL_SLOT/(CPU_CYCLES * 8)))` | Useful work completed |
129+
130+
### Stage 2 resource groups
131+
132+
Instead of hierarchical levels, Arm organizes detailed metrics into effectiveness groups as shown below:
133+
134+
- Branch Effectiveness: Misprediction rates, MPKI
135+
- ITLB/DTLB Effectiveness: Translation lookaside buffer efficiency
136+
- L1I/L1D/L2/LL Cache Effectiveness: Cache hit ratios and MPKI
137+
- Operation Mix: Breakdown of instruction types (SIMD, integer, load/store)
138+
- Cycle Accounting: Frontend vs. backend stall percentages
139+
140+
### Key performance events
141+
142+
Neoverse cores expose approximately 100 hardware events optimized for server workloads, including:
143+
144+
| Event Name | Purpose / Usage |
145+
| :-------------------- | :--------------------------------------------------------------------------------------- |
146+
| `CPU_CYCLES` | Core clock cycles (baseline for normalization). |
147+
| `OP_SPEC` | Speculatively executed micro-operations (used as slot denominator). |
148+
| `OP_RETIRED` | Retired micro-operations (used to measure useful work). |
149+
| `INST_RETIRED` | Instructions retired (architectural measure; used for IPC, MPKI normalization). |
150+
| `INST_SPEC` | Instructions speculatively executed (needed for operation mix and speculation analysis). |
151+
| `STALL_SLOT` | Total stall slots (foundation for efficiency metrics). |
152+
| `STALL_SLOT_FRONTEND` | Stall slots due to frontend resource constraints. |
153+
| `STALL_SLOT_BACKEND` | Stall slots due to backend resource constraints. |
154+
| `BR_RETIRED` | Branches retired (baseline for branch misprediction ratio). |
155+
| `BR_MIS_PRED_RETIRED` | Mispredicted branches retired (branch effectiveness, speculation waste). |
156+
| `L1I_CACHE_REFILL` | Instruction cache refills (frontend stalls due to I-cache misses). |
157+
| `ITLB_WALK` | Instruction TLB walks (frontend stalls due to translation). |
158+
| `L1D_CACHE_REFILL` | Data cache refills (backend stalls due to L1D misses). |
159+
| `L2D_CACHE_REFILL` | Unified L2 cache refills (backend stalls from L2 misses). |
160+
| `LL_CACHE_MISS_RD` | Last-level/system cache read misses (backend stalls from LLC/memory). |
161+
| `DTLB_WALK` | Data TLB walks (backend stalls due to translation). |
162+
| `MEM_ACCESS` | Total memory accesses (baseline for cache/TLB effectiveness ratios). |
163+
164+
165+
## Arm compared to x86
166+
167+
### Conceptual similarities
168+
169+
Both architectures adhere to the same fundamental top-down performance analysis philosophy:
170+
171+
1. Four-category classification: Retiring, Bad Speculation, Frontend Bound, Backend Bound
172+
2. Slot-based accounting: Pipeline utilization measured in issue or rename slots
173+
3. Hierarchical analysis: Broad classification followed by drill-down into dominant bottlenecks
174+
4. Resource attribution: Map performance issues to specific CPU micro-architectural components
175+
176+
### Key Differences
177+
178+
| Aspect | x86 Intel | Arm Neoverse |
179+
| :-- | :-- | :-- |
180+
| Hierarchy Model | Multi-level tree (Level 1 → Level 2 → Level 3+) | Two-stage: Topdown Level 1 + Resource Groups |
181+
| Slot Width | 4 issue slots per cycle (typical) | 8 rename slots per cycle (Neoverse V1) |
182+
| Formula Basis | Micro-operation (uop) centric | Operation and cycle centric |
183+
| Event Naming | Intel-specific mnemonics | Arm-specific mnemonics |
184+
| Drill-down Strategy | Strict hierarchical descent | Exploration by resource groups |
185+
186+
### Event Mapping Examples
187+
188+
| Performance Question | x86 Intel Events | Arm Neoverse Events |
189+
| :-- | :-- | :-- |
190+
| Frontend bound? | `IDQ_UOPS_NOT_DELIVERED.*` | `STALL_SLOT_FRONTEND` |
191+
| Bad speculation? | `BR_MISP_RETIRED.*` | `BR_MIS_PRED_RETIRED` |
192+
| Memory bound? | `CYCLE_ACTIVITY.STALLS_L3_MISS` | `L1D_CACHE_REFILL`, `L2D_CACHE_REFILL` |
193+
| Cache effectiveness? | `MEM_LOAD_RETIRED.L3_MISS_PS` | Cache refill metrics / Cache access metrics |
194+
195+
While it doesn't make sense to directly compare PMU counters for the Arm and x86 architectures, it is useful to understand the top-down methodologies for each so you can do effective performance analysis and compare you code running on each architecture.
196+
197+
Continue to the next step to try a code example.

0 commit comments

Comments
 (0)