You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/topdown-compare/1-top-down.md
+5-11Lines changed: 5 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: "Learn about Arm Neoverse and Intel x86 top-down performance analysis"
2
+
title: "Analyze Intel x86 and Arm Neoverse top-down performance methodologies"
3
3
weight: 3
4
4
5
5
### FIXED, DO NOT MODIFY
@@ -12,23 +12,17 @@ This is a common question from both software developers and performance engineer
12
12
13
13
Both Intel x86 and Arm Neoverse CPUs provide sophisticated Performance Monitoring Units (PMUs) with hundreds of hardware counters. Instead of trying to list all available counters and compare microarchitecture, it makes more sense to focus on the performance methodologies they enable and the calculations used for performance metrics.
14
14
15
-
While the specific counter names and formulas differ between architectures, both Intel x86 and Arm Neoverse have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four buckets:
15
+
While the specific counter names and formulas differ between architectures, both Intel x86 and Arm Neoverse have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four key areas:
16
16
17
-
- Retiring
18
-
- Bad Speculation
19
-
- Frontend Bound
20
-
- Backend Bound
17
+
**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches. Additionally, **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, and **Backend Bound** covers slots stalled by execution resource constraints.
21
18
22
19
This Learning Path provides a comparison of how x86 processors implement four-level hierarchical top-down analysis compared to Arm Neoverse's two-stage methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas.
23
20
24
21
## Introduction to top-down performance analysis
25
22
26
-
The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories:
23
+
The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories.
27
24
28
-
- Retiring: pipeline slots that successfully complete useful work
29
-
- Bad Speculation: slots wasted on mispredicted branches
30
-
- Frontend Bound: slots stalled due to instruction fetch/decode limitations
31
-
- Backend Bound: slots stalled due to execution resource constraints
25
+
**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches and pipeline flushes. **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, whereas **Backend Bound** covers slots stalled by execution resource constraints such as cache misses or arithmetic unit availability.
32
26
33
27
The methodology uses a hierarchical approach that allows you to drill down only into the dominant bottleneck category, and avoid the complexity of analyzing all possible performance issues at the same time.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/topdown-compare/1b-arm.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,25 +1,21 @@
1
1
---
2
-
title: "Deploy Arm Neoverse 2-stage top-down analysis"
2
+
title: "Implement Arm Neoverse 2-stage top-down analysis"
3
3
weight: 5
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
-
## Configure 8-slot rename unit accounting with Arm PMU counters
8
+
## Explore Arm's approach to performance analysis
9
9
10
-
Arm developed a complementary top-down methodology specifically for Neoverse server cores. The Arm Neoverse architecture uses an 8-slot rename unit for pipeline bandwidth accounting, differing from Intel's issue-slot model.
10
+
After understanding Intel's comprehensive 4-level hierarchy, you can explore how Arm approached the same performance analysis challenge with a different philosophy. Arm developed a complementary top-down methodology specifically for Neoverse server cores that prioritizes practical usability while maintaining analysis effectiveness.
11
11
12
-
Unlike Intel's hierarchical model, Arm employs a streamlined two-stage methodology that balances analysis depth with practical usability:
12
+
The Arm Neoverse architecture uses an 8-slot rename unit for pipeline bandwidth accounting, differing from Intel's issue-slot model. Unlike Intel's hierarchical model, Arm employs a streamlined two-stage methodology that balances analysis depth with practical usability.
Stage 1 identifies high-level bottlenecks using the same four categories as Intel but with Arm-specific PMU events and formulas. This stage uses slot-based accounting similar to Intel's approach while employing Arm event names and calculations tailored to the Neoverse architecture.
17
17
18
-
### Execute Stage 2: Explore resource effectiveness groups
19
-
20
-
Stage 2 focuses on resource-specific effectiveness metrics grouped by CPU component. This stage provides industry-standard metrics like MPKI (Misses Per Kilo Instructions) and offers detailed breakdown without the strict hierarchical drilling required by Intel's methodology.
21
-
22
-
### Configure Arm-specific PMU counter formulas
18
+
#### Configure Arm-specific PMU counter formulas
23
19
24
20
Arm uses different top-down metrics based on different events but the concept remains similar to Intel's approach. The key difference lies in the formula calculations and slot accounting methodology:
25
21
@@ -30,7 +26,11 @@ Arm uses different top-down metrics based on different events but the concept re
### Navigate resource groups without hierarchical constraints
29
+
### Execute Stage 2: Explore resource effectiveness groups
30
+
31
+
Stage 2 focuses on resource-specific effectiveness metrics grouped by CPU component. This stage provides industry-standard metrics like MPKI (Misses Per Kilo Instructions) and offers detailed breakdown without the strict hierarchical drilling required by Intel's methodology.
32
+
33
+
#### Navigate resource groups without hierarchical constraints
34
34
35
35
Instead of Intel's hierarchical levels, Arm organizes detailed metrics into effectiveness groups that can be explored independently. **Branch Effectiveness** provides misprediction rates and MPKI, while **ITLB/DTLB Effectiveness** measures translation lookaside buffer efficiency. **Cache Effectiveness** groups (L1I/L1D/L2/LL) deliver cache hit ratios and MPKI across the memory hierarchy. Additionally, **Operation Mix** breaks down instruction types (SIMD, integer, load/store), and **Cycle Accounting** tracks frontend versus backend stall percentages.
## Contrast Intel and Arm Neoverse implementation approaches
9
9
10
-
Despite their different implementation approaches, both Intel x86 and Arm Neoverse architectures adhere to the same fundamental top-down performance analysis philosophy:
10
+
After understanding each architecture's methodology individually, you can now examine how they differ in implementation while achieving equivalent analysis capabilities.
11
11
12
-
**Four-category classification** forms the foundation, using Retiring, Bad Speculation, Frontend Bound, and Backend Bound across both architectures. **Slot-based accounting** measures pipeline utilization, though Intel uses issue slots while Arm employs rename slots. **Hierarchical analysis** enables broad classification followed by drill-down into dominant bottlenecks, and **resource attribution** maps performance issues to specific CPU micro-architectural components for targeted optimization.
12
+
## Review shared implementation principles
13
13
14
-
## Contrast 4-level hierarchical and 2-stage methodologies
14
+
Both architectures implement the same fundamental approach with architecture-specific adaptations:
15
15
16
-
While both architectures share the same fundamental principles, their implementation strategies differ significantly in structure and execution:
16
+
- Slot-based accounting: Pipeline utilization measured in issue or rename slots
17
+
- Hierarchical analysis: Broad classification followed by drill-down into dominant bottlenecks
18
+
- Resource attribution: Map performance issues to specific CPU micro-architectural components
19
+
20
+
## Compare 4-level hierarchical and 2-stage methodologies
21
+
22
+
| Aspect | Intel x86 | Arm Neoverse |
17
23
| :-- | :-- | :-- |
18
24
| Hierarchy Model | Multi-level tree (Level 1 → Level 2 → Level 3+) | Two-stage: Topdown Level 1 + Resource Groups |
19
25
| Slot Width | 4 issue slots per cycle (typical) | 8 rename slots per cycle (Neoverse V1) |
20
26
| Formula Basis | Micro-operation (uop) centric | Operation and cycle centric |
Both Arm Neoverse and modern x86 cores expose hardware PMU events that enable equivalent top-down analysis, despite different counter names and calculation methods. Intel x86 processors use a four-level hierarchical methodology based on slot-based pipeline accounting, relying on PMU counters such as `UOPS_RETIRED.RETIRE_SLOTS`, `IDQ_UOPS_NOT_DELIVERED.CORE`, and `CPU_CLK_UNHALTED.THREAD` to break down performance into retiring, bad speculation, frontend bound, and backend bound categories. Linux Perf serves as the standard collection tool, using commands like `perf stat --topdown` and the `-M topdownl1` option for detailed breakdowns.
267
267
268
-
Intel x86 processors use a four-level hierarchical top-down analysis methodology based on slot-based pipeline accounting. This approach relies on PMU counters such as `UOPS_RETIRED.RETIRE_SLOTS`, `IDQ_UOPS_NOT_DELIVERED.CORE`, and `CPU_CLK_UNHALTED.THREAD` to break down performance into categories like retiring, bad speculation, frontend bound, and backend bound. The standard tool for collecting these metrics is Linux Perf, using commands like `perf stat --topdown` and the `-M topdownl1` option for detailed breakdowns.
268
+
Arm Neoverse platforms implement a complementary two-stage methodology where Stage 1 focuses on topdown categories using counters such as `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` to analyze pipeline stalls and instruction retirement. Stage 2 evaluates resource effectiveness, including cache and operation mix metrics through `topdown-tool`, which accepts the desired metric group via the `-m` argument.
269
269
270
-
Arm Neoverse platforms implement a two-stage top-down methodology. Stage 1 focuses on topdown categories using counters such as `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` to analyze pipeline stalls and instruction retirement. Stage 2 evaluates resource effectiveness, including cache and operation mix metrics. The recommended tool for collecting these metrics is `topdown-tool`, specifying the desired metric group with the `-m` argument.
271
-
272
-
Both architectures identify the same performance bottleneck categories, allowing you to use similar optimization strategies across Intel and Arm platforms while accounting for the methodological differences in measurement and analysis depth.
270
+
Both architectures identify the same performance bottleneck categories, enabling similar optimization strategies across Intel and Arm platforms while accounting for methodological differences in measurement depth and analysis approach.
0 commit comments