Final tweaks

madeline-underwood · madeline-underwood · commit bf9506a495ab · 2025-10-03T21:27:35.000+01:00
diff --git a/content/learning-paths/cross-platform/topdown-compare/1-top-down.md b/content/learning-paths/cross-platform/topdown-compare/1-top-down.md
@@ -1,5 +1,5 @@
 ---
-title: "Learn about Arm Neoverse and Intel x86 top-down performance analysis"
+title: "Analyze Intel x86 and Arm Neoverse top-down performance methodologies"
 weight: 3
 
 ### FIXED, DO NOT MODIFY
@@ -12,23 +12,17 @@ This is a common question from both software developers and performance engineer
 
 Both Intel x86 and Arm Neoverse CPUs provide sophisticated Performance Monitoring Units (PMUs) with hundreds of hardware counters. Instead of trying to list all available counters and compare microarchitecture, it makes more sense to focus on the performance methodologies they enable and the calculations used for performance metrics. 
 
-While the specific counter names and formulas differ between architectures, both Intel x86 and Arm Neoverse have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four buckets:
+While the specific counter names and formulas differ between architectures, both Intel x86 and Arm Neoverse have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four key areas:
 
-- Retiring
-- Bad Speculation
-- Frontend Bound
-- Backend Bound
+**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches. Additionally, **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, and **Backend Bound** covers slots stalled by execution resource constraints.
 
 This Learning Path provides a comparison of how x86 processors implement four-level hierarchical top-down analysis compared to Arm Neoverse's two-stage methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas.
 
 ## Introduction to top-down performance analysis
 
-The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories:
+The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories.
 
-- Retiring: pipeline slots that successfully complete useful work
-- Bad Speculation: slots wasted on mispredicted branches
-- Frontend Bound: slots stalled due to instruction fetch/decode limitations
-- Backend Bound: slots stalled due to execution resource constraints
+**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches and pipeline flushes. **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, whereas **Backend Bound** covers slots stalled by execution resource constraints such as cache misses or arithmetic unit availability.
 
 The methodology uses a hierarchical approach that allows you to drill down only into the dominant bottleneck category, and avoid the complexity of analyzing all possible performance issues at the same time. 
 
diff --git a/content/learning-paths/cross-platform/topdown-compare/1b-arm.md b/content/learning-paths/cross-platform/topdown-compare/1b-arm.md
@@ -1,25 +1,21 @@
 ---
-title: "Deploy Arm Neoverse 2-stage top-down analysis"
+title: "Implement Arm Neoverse 2-stage top-down analysis"
 weight: 5
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
-## Configure 8-slot rename unit accounting with Arm PMU counters
+## Explore Arm's approach to performance analysis
 
-Arm developed a complementary top-down methodology specifically for Neoverse server cores. The Arm Neoverse architecture uses an 8-slot rename unit for pipeline bandwidth accounting, differing from Intel's issue-slot model.
+After understanding Intel's comprehensive 4-level hierarchy, you can explore how Arm approached the same performance analysis challenge with a different philosophy. Arm developed a complementary top-down methodology specifically for Neoverse server cores that prioritizes practical usability while maintaining analysis effectiveness.
 
-Unlike Intel's hierarchical model, Arm employs a streamlined two-stage methodology that balances analysis depth with practical usability:
+The Arm Neoverse architecture uses an 8-slot rename unit for pipeline bandwidth accounting, differing from Intel's issue-slot model. Unlike Intel's hierarchical model, Arm employs a streamlined two-stage methodology that balances analysis depth with practical usability.
 
 ### Execute Stage 1: Calculate top-down performance categories
 
 Stage 1 identifies high-level bottlenecks using the same four categories as Intel but with Arm-specific PMU events and formulas. This stage uses slot-based accounting similar to Intel's approach while employing Arm event names and calculations tailored to the Neoverse architecture.
 
-### Execute Stage 2: Explore resource effectiveness groups
-
-Stage 2 focuses on resource-specific effectiveness metrics grouped by CPU component. This stage provides industry-standard metrics like MPKI (Misses Per Kilo Instructions) and offers detailed breakdown without the strict hierarchical drilling required by Intel's methodology.
-
-### Configure Arm-specific PMU counter formulas
+#### Configure Arm-specific PMU counter formulas
 
 Arm uses different top-down metrics based on different events but the concept remains similar to Intel's approach. The key difference lies in the formula calculations and slot accounting methodology:
 
@@ -30,7 +26,11 @@ Arm uses different top-down metrics based on different events but the concept re
 | Bad speculation | `100 * (1 - (OP_RETIRED/OP_SPEC)) * (1 - (STALL_SLOT/(CPU_CYCLES * 8))) + (BR_MIS_PRED / (4 * CPU_CYCLES))` | Misprediction recovery |
 | Retiring | `100 * (OP_RETIRED/OP_SPEC) * (1 - (STALL_SLOT/(CPU_CYCLES * 8)))` | Useful work completed |
 
-### Navigate resource groups without hierarchical constraints
+### Execute Stage 2: Explore resource effectiveness groups
+
+Stage 2 focuses on resource-specific effectiveness metrics grouped by CPU component. This stage provides industry-standard metrics like MPKI (Misses Per Kilo Instructions) and offers detailed breakdown without the strict hierarchical drilling required by Intel's methodology.
+
+#### Navigate resource groups without hierarchical constraints
 
 Instead of Intel's hierarchical levels, Arm organizes detailed metrics into effectiveness groups that can be explored independently. **Branch Effectiveness** provides misprediction rates and MPKI, while **ITLB/DTLB Effectiveness** measures translation lookaside buffer efficiency. **Cache Effectiveness** groups (L1I/L1D/L2/LL) deliver cache hit ratios and MPKI across the memory hierarchy. Additionally, **Operation Mix** breaks down instruction types (SIMD, integer, load/store), and **Cycle Accounting** tracks frontend versus backend stall percentages.
 
diff --git a/content/learning-paths/cross-platform/topdown-compare/1c-compare-arch.md b/content/learning-paths/cross-platform/topdown-compare/1c-compare-arch.md
@@ -1,27 +1,33 @@
 ---
-title: "Evaluate Intel x86 and Arm Neoverse top-down analysis: PMU counters and methodology differences"
+title: "Evaluate cross-platform PMU counter differences"
 weight: 6
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
-## Recognize universal top-down performance analysis principles
+## Contrast Intel and Arm Neoverse implementation approaches
 
-Despite their different implementation approaches, both Intel x86 and Arm Neoverse architectures adhere to the same fundamental top-down performance analysis philosophy:
+After understanding each architecture's methodology individually, you can now examine how they differ in implementation while achieving equivalent analysis capabilities.
 
-**Four-category classification** forms the foundation, using Retiring, Bad Speculation, Frontend Bound, and Backend Bound across both architectures. **Slot-based accounting** measures pipeline utilization, though Intel uses issue slots while Arm employs rename slots. **Hierarchical analysis** enables broad classification followed by drill-down into dominant bottlenecks, and **resource attribution** maps performance issues to specific CPU micro-architectural components for targeted optimization.
+## Review shared implementation principles
 
-## Contrast 4-level hierarchical and 2-stage methodologies
+Both architectures implement the same fundamental approach with architecture-specific adaptations:
 
-While both architectures share the same fundamental principles, their implementation strategies differ significantly in structure and execution:
+- Slot-based accounting: Pipeline utilization measured in issue or rename slots
+- Hierarchical analysis: Broad classification followed by drill-down into dominant bottlenecks
+- Resource attribution: Map performance issues to specific CPU micro-architectural components
+
+## Compare 4-level hierarchical and 2-stage methodologies  
+
+| Aspect | Intel x86 | Arm Neoverse |
 | :-- | :-- | :-- |
 | Hierarchy Model | Multi-level tree (Level 1 → Level 2 → Level 3+) | Two-stage: Topdown Level 1 + Resource Groups |
 | Slot Width | 4 issue slots per cycle (typical) | 8 rename slots per cycle (Neoverse V1) |
 | Formula Basis | Micro-operation (uop) centric | Operation and cycle centric |
 | Event Naming | Intel-specific mnemonics | Arm-specific mnemonics |
 | Drill-down Strategy | Strict hierarchical descent | Exploration by resource groups |
 
-### Event mapping examples
+## Map equivalent PMU counters across architectures
 
 | Performance Question | x86 Intel Events | Arm Neoverse Events |
 | :-- | :-- | :-- |
diff --git a/content/learning-paths/cross-platform/topdown-compare/2-code-examples.md b/content/learning-paths/cross-platform/topdown-compare/2-code-examples.md
@@ -1,5 +1,5 @@
 ---
-title: "Compare Arm Neoverse and Intel x86 performance using topdown-tool and Perf PMU counters"
+title: "Measure cross-platform performance with topdown-tool and Perf PMU counters"
 weight: 7
 
 ### FIXED, DO NOT MODIFY
@@ -265,9 +265,7 @@ Crypto Operations Percentage........ 0.00% operations
 
 Both Arm Neoverse and modern x86 cores expose hardware PMU events that enable equivalent top-down analysis, despite different counter names and calculation methods. Intel x86 processors use a four-level hierarchical methodology based on slot-based pipeline accounting, relying on PMU counters such as `UOPS_RETIRED.RETIRE_SLOTS`, `IDQ_UOPS_NOT_DELIVERED.CORE`, and `CPU_CLK_UNHALTED.THREAD` to break down performance into retiring, bad speculation, frontend bound, and backend bound categories. Linux Perf serves as the standard collection tool, using commands like `perf stat --topdown` and the `-M topdownl1` option for detailed breakdowns.
 
-Intel x86 processors use a four-level hierarchical top-down analysis methodology based on slot-based pipeline accounting. This approach relies on PMU counters such as `UOPS_RETIRED.RETIRE_SLOTS`, `IDQ_UOPS_NOT_DELIVERED.CORE`, and `CPU_CLK_UNHALTED.THREAD` to break down performance into categories like retiring, bad speculation, frontend bound, and backend bound. The standard tool for collecting these metrics is Linux Perf, using commands like `perf stat --topdown` and the `-M topdownl1` option for detailed breakdowns.
+Arm Neoverse platforms implement a complementary two-stage methodology where Stage 1 focuses on topdown categories using counters such as `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` to analyze pipeline stalls and instruction retirement. Stage 2 evaluates resource effectiveness, including cache and operation mix metrics through `topdown-tool`, which accepts the desired metric group via the `-m` argument.
 
-Arm Neoverse platforms implement a two-stage top-down methodology. Stage 1 focuses on topdown categories using counters such as `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` to analyze pipeline stalls and instruction retirement. Stage 2 evaluates resource effectiveness, including cache and operation mix metrics. The recommended tool for collecting these metrics is `topdown-tool`, specifying the desired metric group with the `-m` argument.
-
-Both architectures identify the same performance bottleneck categories, allowing you to use similar optimization strategies across Intel and Arm platforms while accounting for the methodological differences in measurement and analysis depth. 
+Both architectures identify the same performance bottleneck categories, enabling similar optimization strategies across Intel and Arm platforms while accounting for methodological differences in measurement depth and analysis approach.