Skip to content

Commit 5cbdf01

Browse files
committed
Review top-down methodology Learning Path
1 parent 935f905 commit 5cbdf01

File tree

6 files changed

+44
-39
lines changed

6 files changed

+44
-39
lines changed

content/learning-paths/cross-platform/topdown-compare/1-top-down.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,17 +14,20 @@ Both Intel x86 and Arm Neoverse CPUs provide sophisticated Performance Monitorin
1414

1515
While the specific counter names and formulas differ between architectures, both Intel x86 and Arm Neoverse have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four key areas:
1616

17-
**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches. Additionally, **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, and **Backend Bound** covers slots stalled by execution resource constraints.
17+
- Retiring
18+
- Bad Speculation
19+
- Frontend Bound
20+
- Backend Bound
1821

19-
This Learning Path provides a comparison of how x86 processors implement four-level hierarchical top-down analysis compared to Arm Neoverse's two-stage methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas.
22+
This Learning Path provides a comparison of how x86 processors implement multi-level hierarchical top-down analysis compared to Arm Neoverse's methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas.
2023

2124
## Introduction to top-down performance analysis
2225

23-
The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories.
26+
The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of the four categories.
2427

2528
**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches and pipeline flushes. **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, whereas **Backend Bound** covers slots stalled by execution resource constraints such as cache misses or arithmetic unit availability.
2629

27-
The methodology uses a hierarchical approach that allows you to drill down only into the dominant bottleneck category, and avoid the complexity of analyzing all possible performance issues at the same time.
30+
The methodology allows you to drill down only into the dominant bottleneck category, avoiding the complexity of analyzing all possible performance issues at the same time.
2831

2932
The next sections compare the Intel x86 methodology with the Arm top-down methodology.
3033

content/learning-paths/cross-platform/topdown-compare/1a-intel.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Implement Intel x86 4-level hierarchical top-down analysis"
2+
title: "Understand Intel x86 multi-level hierarchical top-down analysis"
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
@@ -8,9 +8,9 @@ layout: learningpathall
88

99
## Configure slot-based accounting with Intel x86 PMU counters
1010

11-
Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process micro-operations (uops). More slots means more work can be done per cycle. The number of slots depends on the microarchitecture design but current Intel processor designs typically have four issue slots per cycle.
11+
Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process micro-operations (uops). More slots means more work can be done per cycle. The number of slots depends on the microarchitecture design, but current Intel processor designs typically have four issue slots per cycle.
1212

13-
Intel's methodology uses a multi-level hierarchy that extends to 4 levels of detail. Each level provides progressively more granular analysis, allowing you to drill down from high-level categories to specific microarchitecture events.
13+
Intel's methodology uses a multi-level hierarchy that typically extends to 3-4 levels of detail. Each level provides progressively more granular analysis, allowing you to drill down from high-level categories to specific microarchitecture events.
1414

1515
## Level 1: Identify top-level performance categories
1616

@@ -27,18 +27,20 @@ Where `SLOTS = 4 * CPU_CLK_UNHALTED.THREAD` on most Intel cores.
2727

2828
Once you've identified the dominant Level 1 category, Level 2 drills into each area to identify broader causes. This level distinguishes between frontend latency and bandwidth limits, or between memory and core execution stalls in the backend.
2929

30-
- Frontend Bound covers frontend latency in comparison with frontend bandwidth
31-
- Backend Bound covers memory bound in comparison with core bound
32-
- Bad Speculation covers branch mispredicts in comparison with machine clears
33-
- Retiring covers base in comparison with microcode sequencer
30+
- Frontend Bound covers frontend latency compared with frontend bandwidth
31+
- Backend Bound covers memory bound compared with core bound
32+
- Bad Speculation covers branch mispredicts compared with machine clears
33+
- Retiring covers base compared with microcode sequencer
3434

3535
## Level 3: Target specific microarchitecture bottlenecks
3636

37-
After identifying broader cause categories in Level 2, Level 3 provides fine-grained attribution that pinpoints specific bottlenecks like DRAM latency, cache misses, or port contention. This precision makes it possible to identify the exact root cause and apply targeted optimizations. Memory Bound expands into detailed cache hierarchy analysis including L1 Bound, L2 Bound, L3 Bound, DRAM Bound, and Store Bound categories, while Core Bound breaks down into execution unit constraints such as Divider and Ports Utilization, along with many other specific microarchitecture-level categories that enable precise performance tuning.
37+
After identifying broader cause categories in Level 2, Level 3 provides fine-grained attribution that pinpoints specific bottlenecks like DRAM latency, cache misses, or port contention. This precision makes it possible to identify the exact root cause and apply targeted optimizations.
38+
39+
Memory Bound expands into detailed cache hierarchy analysis including L1 Bound, L2 Bound, L3 Bound, DRAM Bound, and Store Bound categories. Core Bound breaks down into execution unit constraints such as Divider and Ports Utilization, along with many other specific microarchitecture-level categories that enable precise performance tuning.
3840

3941
## Level 4: Access specific PMU counter events
4042

41-
The final level provides direct access to the specific microarchitecture events that cause the inefficiencies. At this level, you work directly with raw PMU counter values to understand the underlying hardware behavior causing performance bottlenecks. This enables precise tuning by identifying exactly which execution units, cache levels, or pipeline stages are limiting performance, allowing you to apply targeted code optimizations or hardware configuration changes.
43+
Level 4 provides direct access to the specific microarchitecture events that cause the inefficiencies. At this level, you work directly with raw PMU counter values to understand the underlying hardware behavior causing performance bottlenecks. This enables precise tuning by identifying exactly which execution units, cache levels, or pipeline stages are limiting performance, allowing you to apply targeted code optimizations or hardware configuration changes.
4244

4345
## Apply essential Intel x86 PMU counters for analysis
4446

@@ -63,5 +65,5 @@ Intel processors expose hundreds of performance events, but top-down analysis re
6365
| `OFFCORE_RESPONSE.*` | Detailed classification of off-core responses (L3 vs. DRAM, local vs. remote socket) |
6466

6567

66-
Using the above levels of metrics you can find out which of the four top-level categories are causing bottlenecks.
68+
Using the above levels of metrics, you can determine which of the four top-level categories are causing bottlenecks.
6769

content/learning-paths/cross-platform/topdown-compare/1b-arm.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Implement Arm Neoverse 2-stage top-down analysis"
2+
title: "Understand Arm Neoverse top-down analysis"
33
weight: 5
44

55
### FIXED, DO NOT MODIFY
@@ -9,15 +9,15 @@ layout: learningpathall
99

1010
After understanding Intel's comprehensive 4-level hierarchy, you can explore how Arm approached the same performance analysis challenge with a different philosophy. Arm developed a complementary top-down methodology specifically for Neoverse server cores that prioritizes practical usability while maintaining analysis effectiveness.
1111

12-
The Arm Neoverse architecture uses an 8-slot rename unit for pipeline bandwidth accounting, differing from Intel's issue-slot model. Unlike Intel's hierarchical model, Arm employs a streamlined two-stage methodology that balances analysis depth with practical usability.
12+
The Arm Neoverse architecture uses an 8-slot rename unit for pipeline bandwidth accounting, which differs from Intel's issue-slot model. Unlike Intel's hierarchical model, Arm employs a streamlined two-stage methodology that balances analysis depth with practical usability.
1313

1414
### Execute Stage 1: Calculate top-down performance categories
1515

16-
Stage 1 identifies high-level bottlenecks using the same four categories as Intel but with Arm-specific PMU events and formulas. This stage uses slot-based accounting similar to Intel's approach while employing Arm event names and calculations tailored to the Neoverse architecture.
16+
Stage 1 identifies high-level bottlenecks using the same four categories as Intel, but with Arm-specific PMU events and formulas. This stage uses slot-based accounting similar to Intel's approach while employing Arm event names and calculations tailored to the Neoverse architecture.
1717

1818
#### Configure Arm-specific PMU counter formulas
1919

20-
Arm uses different top-down metrics based on different events but the concept remains similar to Intel's approach. The key difference lies in the formula calculations and slot accounting methodology:
20+
Arm uses different top-down metrics based on different events, but the concept remains similar to Intel's approach. The key difference lies in the formula calculations and slot accounting methodology:
2121

2222
| Metric | Formula | Purpose |
2323
| :-- | :-- | :-- |
@@ -32,7 +32,9 @@ Stage 2 focuses on resource-specific effectiveness metrics grouped by CPU compon
3232

3333
#### Navigate resource groups without hierarchical constraints
3434

35-
Instead of Intel's hierarchical levels, Arm organizes detailed metrics into effectiveness groups that can be explored independently. **Branch Effectiveness** provides misprediction rates and MPKI, while **ITLB/DTLB Effectiveness** measures translation lookaside buffer efficiency. **Cache Effectiveness** groups (L1I/L1D/L2/LL) deliver cache hit ratios and MPKI across the memory hierarchy. Additionally, **Operation Mix** breaks down instruction types (SIMD, integer, load/store), and **Cycle Accounting** tracks frontend versus backend stall percentages.
35+
Instead of Intel's hierarchical levels, Arm organizes detailed metrics into effectiveness groups that can be explored independently.
36+
37+
**Branch Effectiveness** provides misprediction rates and MPKI, while **ITLB/DTLB Effectiveness** measures translation lookaside buffer efficiency. **Cache Effectiveness** groups (L1I/L1D/L2/LL) deliver cache hit ratios and MPKI across the memory hierarchy. Additionally, **Operation Mix** breaks down instruction types (SIMD, integer, load/store), and **Cycle Accounting** tracks frontend versus backend stall percentages.
3638

3739
## Apply essential Arm Neoverse PMU counters for analysis
3840

content/learning-paths/cross-platform/topdown-compare/1c-compare-arch.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ After understanding each architecture's methodology individually, you can now ex
1313
- Hierarchical analysis: broad classification followed by drill-down into dominant bottlenecks
1414
- Resource attribution: map performance issues to specific CPU micro-architectural components
1515

16-
## Compare 4-level hierarchical and 2-stage methodologies
16+
## Compare multi-level hierarchical and resource groups methodologies
1717

1818
| Aspect | Intel x86 | Arm Neoverse |
1919
| :-- | :-- | :-- |

content/learning-paths/cross-platform/topdown-compare/2-code-examples.md

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -98,9 +98,9 @@ S0-D0-C1 1 8.5% 0.0% 0
9898
6.052117775 seconds time elapsed
9999
```
100100

101-
You see a very large `backend bound` component for this program.
101+
You see a very large `backend bound` component for this program.
102102

103-
You can also run with the `-M topdownl1` argument on Perf.
103+
You can also run with the `-M topdownl1` argument with Perf.
104104

105105
```console
106106
taskset -c 1 perf stat -C 1 -M topdownl1 ./test 1000000000
@@ -129,21 +129,21 @@ Done. Final result: 0.000056
129129
6.029283206 seconds time elapsed
130130
```
131131

132-
Again, showing `Backend_Bound` value very high (0.96). Notice the x86-specific PMU counters:
132+
Again, showing a `Backend_Bound` value that is very high (0.96). Notice the x86-specific PMU counters:
133133
- `uops_issued.any` and `uops_retired.retire_slots` for micro-operation accounting
134134
- `idq_uops_not_delivered.core` for frontend delivery failures
135135
- `cpu_clk_unhalted.thread` for cycle normalization
136136

137137
If you want to learn more, you can continue with the Level 2 and Level 3 hierarchical analysis.
138138

139139

140-
## Use the Arm Neoverse 2-stage top-down methodology
140+
## Use the Arm Neoverse top-down methodology
141141

142-
Arm's approach uses a 2-stage methodology with PMU counters like `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` for Stage 1 analysis, followed by resource effectiveness groups in Stage 2.
142+
Arm's approach uses a methodology with PMU counters like `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` for Stage 1 analysis, followed by resource effectiveness groups in Stage 2.
143143

144144
Make sure you install the Arm topdown-tool using the [Telemetry Solution install guide](/install-guides/topdown-tool/).
145145

146-
Collect Stage 2 general metrics including Instructions Per Cycle (IPC):
146+
Collect general metrics including Instructions Per Cycle (IPC):
147147

148148
```console
149149
taskset -c 1 topdown-tool -m General ./test 1000000000
@@ -178,14 +178,14 @@ Frontend Stalled Cycles 0.04% cycles
178178
Backend Stalled Cycles. 88.15% cycles
179179
```
180180

181-
This confirms the example has high backend stalls equivalent to x86's Backend_Bound category. Notice how Arm's Stage 1 uses percentage of cycles rather than Intel's slot-based accounting.
181+
This confirms the example has high backend stalls, equivalent to x86's Backend_Bound category. Notice how Arm's Stage 1 uses percentage of cycles rather than Intel's slot-based accounting.
182182

183183
You can continue to use the `topdown-tool` for additional microarchitecture exploration.
184184

185185
For L1 data cache:
186186

187187
```console
188-
taskset -c 1 topdown-tool -m L1D_Cache_Effectiveness ./test 1000000000
188+
taskset -c 1 topdown-tool -m L1D_Cache_Effectiveness ./test 1000000000
189189
```
190190

191191
The output is similar to:
@@ -203,7 +203,7 @@ L1D Cache Miss Ratio......... 0.000 per cache access
203203
For L1 instruction cache effectiveness:
204204

205205
```console
206-
taskset -c 1 topdown-tool -m L1D_Cache_Effectiveness ./test 1000000000
206+
taskset -c 1 topdown-tool -m L1I_Cache_Effectiveness ./test 1000000000
207207
```
208208

209209
The output is similar to:
@@ -213,9 +213,9 @@ Performing 1000000000 dependent floating-point divisions...
213213
Done. Final result: 0.000056
214214
Stage 2 (uarch metrics)
215215
=======================
216-
[L1 Data Cache Effectiveness]
217-
L1D Cache MPKI............... 0.022 misses per 1,000 instructions
218-
L1D Cache Miss Ratio......... 0.000 per cache access
216+
[L1 Instruction Cache Effectiveness]
217+
L1I Cache MPKI............... 0.022 misses per 1,000 instructions
218+
L1I Cache Miss Ratio......... 0.000 per cache access
219219
```
220220

221221
For last level cache:
@@ -263,9 +263,11 @@ Crypto Operations Percentage........ 0.00% operations
263263

264264
## Cross-architecture performance analysis summary
265265

266-
Both Arm Neoverse and modern x86 cores expose hardware PMU events that enable equivalent top-down analysis, despite different counter names and calculation methods. Intel x86 processors use a four-level hierarchical methodology based on slot-based pipeline accounting, relying on PMU counters such as `UOPS_RETIRED.RETIRE_SLOTS`, `IDQ_UOPS_NOT_DELIVERED.CORE`, and `CPU_CLK_UNHALTED.THREAD` to break down performance into retiring, bad speculation, frontend bound, and backend bound categories. Linux Perf serves as the standard collection tool, using commands like `perf stat --topdown` and the `-M topdownl1` option for detailed breakdowns.
266+
Both Arm Neoverse and modern x86 cores expose hardware PMU events that enable equivalent top-down analysis, despite different counter names and calculation methods.
267+
268+
Intel x86 processors use a four-level hierarchical methodology based on slot-based pipeline accounting, relying on PMU counters such as `UOPS_RETIRED.RETIRE_SLOTS`, `IDQ_UOPS_NOT_DELIVERED.CORE`, and `CPU_CLK_UNHALTED.THREAD` to break down performance into retiring, bad speculation, frontend bound, and backend bound categories. Linux Perf serves as the standard collection tool, using commands like `perf stat --topdown` and the `-M topdownl1` option for detailed breakdowns.
267269

268270
Arm Neoverse platforms implement a complementary two-stage methodology where Stage 1 focuses on topdown categories using counters such as `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` to analyze pipeline stalls and instruction retirement. Stage 2 evaluates resource effectiveness, including cache and operation mix metrics through `topdown-tool`, which accepts the desired metric group via the `-m` argument.
269271

270-
Both architectures identify the same performance bottleneck categories, enabling similar optimization strategies across Intel and Arm platforms while accounting for methodological differences in measurement depth and analysis approach.
272+
Both architectures identify the same performance bottleneck categories, enabling similar optimization strategies across Intel and Arm platforms while accounting for methodological differences in measurement depth and analysis approach.
271273

content/learning-paths/cross-platform/topdown-compare/_index.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,12 @@
11
---
22
title: Compare Arm Neoverse and Intel x86 top-down performance analysis with PMU counters
33

4-
draft: true
5-
cascade:
6-
draft: true
7-
84
minutes_to_complete: 30
95

106
who_is_this_for: This is an advanced topic for software developers and performance engineers who want to understand the similarities and differences between Arm Neoverse and Intel x86 top-down performance analysis using PMU counters, Linux Perf, and the topdown-tool.
117

128
learning_objectives:
13-
- Compare Intel x86 4-level hierarchical top-down methodology with Arm Neoverse 2-stage approach using PMU counters
9+
- Compare Intel x86 multi-level hierarchical methodology with Arm Neoverse micro-architecture exploration methodology
1410
- Execute performance analysis using Linux Perf on x86 and topdown-tool on Arm systems
1511
- Analyze Backend Bound, Frontend Bound, Bad Speculation, and Retiring categories across both architectures
1612

0 commit comments

Comments
 (0)