diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 0000000000..c2da7c098d --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,222 @@ +# Project Overview + +This project is a static website created using Hugo and markdown files. The purpose of the content is to explain how-to topics to software developers targeting various Arm platforms. + +Assume the audience is made up of Arm software developers. Bias information toward Arm platforms. For Linux, assume systems are aarch64 architecture and not x86. Readers also use macOS and Windows on Arm systems, and assume Arm architecture where relevant. + +## Project structure + +The key directories are: + +### Top Level Structure + + /content - The main directory containing all Learning Paths and install guides as markdown files + /themes - HTML templates and styling elements that render the content into the final website + /tools - Python scripts for automated website integrity checking + config.toml - High-level Hugo configuration settings + +### Content Organization: + +The /content directory is organized into: + +- learning-paths/ Core learning content organized by categories: + -- embedded-and-microcontrollers/ MCU, IoT, and embedded development topics + -- servers-and-cloud-computing/ Server, cloud, and enterprise computing topics + -- mobile-graphics-and-gaming/ Mobile app development, graphics, and gaming + -- cross-platform/ Cross-platform development and general programming topics, these appear in multiple categories on the website + -- laptops-and-desktops/ Desktop application development, primarily Windows on Arm and macOS + -- automotive/ Automotive and ADAS development + -- iot/ IoT-specific Learning Paths + +- install-guides/ - Tool installation guides with supporting subdirectories organized by tool categories like docker/, gcc/, license/, browsers/, plus an _images/ directory for screenshots and diagrams + +These are special directories and not used for regular content creation: + migration/ Migration guides and resources, this maps to https://learn.arm.com/migration + lists/ Content listing and organization files, this maps to https://learn.arm.com/lists + stats/ Website statistics and analytics, this maps to https://learn.arm.com/stats + +The /content directory is the primary workspace where contributors add new Learning Paths as markdown files, organized into category-specific subdirectories that correspond to the different learning path topics available on the site at https://learn.arm.com/. + +## Content requirements + +Read the files in the directory `content/learning-paths/cross-platform/_example-learning-path` for information about how Learning Path content should be created. Some additional help is listed below. + +### Content structure + +Each Learning Path must have an _index.md file and a _next-steps.md file. The _index.md file contains the main content of the Learning Path. The _next-steps.md file contains links to related content and is included at the end of the Learning Path. + +The _index.md file should contain the following front matter and content sections: + +Front Matter (YAML format): +- `title`: Imperative heading following the [verb] + [technology] + [outcome] format +- `weight`: Numerical ordering for display sequence, weight is 1 for _index.md and each page is ordered by weight, no markdown files should have the same weight in a directory +- `layout`: Template type (usually "learningpathall") +- `minutes_to_complete`: Realistic time estimate for completion +- `prerequisites`: List of required knowledge, tools, or prior learning paths +- `author_primary`: Main contributor's name, multiple authors can be listed separated using - on new lines +- `subjects`: Technology categories for filtering and search, this is a closed list and must match one of the subjects listed on https://learn.arm.com/learning-paths/cross-platform/_example-learning-path/write-2-metadata/ +- `armips`: Relevant Arm IP, stick to Neoverse, Cortex-A, Cortex-M, etc. Don't list specific CPU models or Arm architecture versions +- `tools_software_languages`: Open category listing Programming languages, frameworks, and development tools used +- `skilllevels`: Skill levels allowed are only Introductory and Advanced +- `operatingsystems`: Operating systems used, must match the closed list on https://learn.arm.com/learning-paths/cross-platform/_example-learning-path/write-2-metadata/ + + +All Learning Paths should generally include: +Title: [Imperative verb] + [technology/tool] + [outcome] +Introduction paragraph: Context + user goal + value proposition +Prerequisites section with explicit requirements and links +Learning objectives: 3-4 bulleted, measurable outcomes with action verbs +Step-by-step sections with logical progression +Clear next steps/conclusion + +For title formatting: +- MUST use imperative voice ("Deploy", "Configure", "Build", "Create") +- MUST include SEO keywords (technology names, tools) +- Examples: "Deploy applications on Arm servers", "Configure Arm processors for optimal performance" + +Learning Path should always be capitalized. + +### Writing style + +Voice and Tone: +- Second person ("you", "your") - NEVER first person ("I", "we") +- Active voice - AVOID passive constructions +- Present tense for descriptions +- Imperative mood for commands +- Confident and developer-friendly tone +- Encouraging language for complex tasks + +Sentence Structure: +- Average 15-20 words per sentence +- Split complex sentences for scalability +- Plain English - avoid jargon overload +- US spellings required (organize/optimize/realize, not organise/optimise/realise) +- "Arm" capitalization required (Arm processors/Neoverse, never ARM or arm; exceptions: "arm64" and "aarch64" are permitted in code, commands, and outputs) +- Define acronyms on first use +- Parallel structure in all lists + +### Arm naming and architecture terms + +- Use Arm for the brand in prose (for example, "Arm processors", "Arm servers"). +- Use arm64 or aarch64 for the CPU architecture; these are acceptable and interchangeable labels. Prefer whichever term a tool, package, or OS uses natively. +- Do not use ARM in any context. +- ARM64 is used by Windows on Arm and Microsoft documentation, so it is acceptable to use ARM64 when specifically referring to Windows on Arm. +- In code blocks, CLI flags, package names, file paths, and outputs, keep the exact casing used by the tool (for example, --arch arm64, uname -m → aarch64). + +### Heading guidelines + +HEADING TYPES: +- Conceptual headings: When explaining technology/motivation ("What is containerization?") +- Imperative headings: When user takes action ("Configure the database") +- Interrogative headings: For FAQ content ("How does Arm differ from x86?") +- ALL headings: Use sentence case (first word capitalized, rest lowercase except proper nouns) + +HIERARCHY: +H1: Page title (imperative + technology + outcome) +H2: Major workflow steps or conceptual sections +H3: Sub-procedures or detailed explanations +H4: Specific technical details or troubleshooting + +### Code samples and formatting + +CONTEXT-BEFORE-CODE RULE: +- ALWAYS provide explanation before code blocks +- Format: [What it does] → [Code] → [Expected outcome] → [Key parameters] + +CODE FORMATTING: + +Use markdown tags for programming languages like bash, python, yaml, json, etc. + +Use console or bash for general commands. Try to use the same one throughout a Learning Path. + +Correct format: + +Use the following command to install required packages: + +```bash +sudo apt-get update && sudo apt-get install -y python3 nodejs +``` + +Use the output tag to show expected command output. + +```output +Reading package lists... Done +Building dependency tree... Done +``` + +FORMATTING STANDARDS: +- **Bold text**: UI elements (buttons, menu items, field names) +- **Italic text**: Emphasis and new terms +- **Code formatting**: Use for file names, commands, code elements + +Use shortcodes for common pitfalls, warnings, important notes + +{{% notice Note %}} +An example note to pay attention to. +{{% /notice %}} + +{{% notice Warning %}} +A warning about a common pitfall. +{{% /notice %}} + +## Avoid looking like AI-generated content + +### Bullet List Management +WARNING SIGNS OF OVER-BULLETING: +- More than 3 consecutive sections using bullet lists +- Bullet points that could be combined into narrative paragraphs +- Lists where items don't have parallel structure +- Bullet points that are actually full sentences better suited for paragraphs + +CONVERSION STRATEGY: + +Use flowing narrative instead of excessive bullets. + +For example, use this format instead of the list below it. + +Arm processors deliver improved performance while enhancing security through hardware-level protections. This architecture provides enhanced scalability for cloud workloads and reduces operational costs through energy efficiency. + +Key benefits include: +• Improved performance +• Better security +• Enhanced scalability +• Reduced costs + +### Natural Writing Patterns + +HUMAN-LIKE TECHNIQUES: +- Vary sentence length: Mix short, medium, and complex sentences +- Use transitional phrases: "Additionally", "However", "As a result", "Furthermore" +- Include contextual explanations: Why something matters, not just what to do +- Add relevant examples: Real-world scenarios that illustrate concepts +- Connect ideas logically: Show relationships between concepts and steps + +CONVERSATIONAL ELEMENTS: + +Instead of: "Execute the following command:" +Use: "Now that you've configured the environment, run the following command to start the service:" + +Instead of: "This provides benefits:" +Use: "You'll notice several advantages with this approach, particularly when working with..." + +## Hyperlink guidelines + +Some links are useful in content, but too many links can be distracting and readers will leave the platform following them. Try to put only necessary links in the content and put other links in the "Next Steps" section at the end of the content. Flag any page with too many links for review. + +### Internal links + +Use a relative path format for internal links that are on learn.arm.com. +For example, use: descriptive link text pointing to a relative path like learning-paths/category/path-name/ + +Examples: +- learning-paths/servers-and-cloud-computing/csp/ (Arm-based instance) +- learning-paths/cross-platform/docker/ (Docker learning path) + +### External links + +Use the full URL for external links that are not on learn.arm.com, these open in a new tab. + +This instruction set enables high-quality Arm Learning Paths content while maintaining consistency and technical accuracy. + + + diff --git a/.wordlist.txt b/.wordlist.txt index afdc6f20b0..dbc30450ad 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -4960,4 +4960,20 @@ SqueezeNet TIdentify goroutines mysqlslap -squeezenet \ No newline at end of file +squeezenet +Aphex +Autoscale +Halide’s +KRaft +Kedify's +Kedify’s +MirrorMaker +NIC’s +Neoverse's +OpenBMC’s +Rebalance +StatefulSets +codemia +multidisks +testsh +uops \ No newline at end of file diff --git a/README.md b/README.md index f2934b727d..d5b4ffb7c4 100644 --- a/README.md +++ b/README.md @@ -20,6 +20,10 @@ All contributions are welcome as long as they relate to software development for Note that all site content, including new contributions, is licensed under a [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/). +## AI Development Tools + +When using AI coding assistants (GitHub Copilot, Amazon Q, Gemini, Cursor, etc.), refer to `.github/copilot-instructions.md` for project-specific guidelines including content requirements, writing style standards, and Arm terminology conventions for Learning Paths. +
# Other Arm Learning Resources diff --git a/content/learning-paths/automotive/_index.md b/content/learning-paths/automotive/_index.md index 43a8e96e60..996e85ca16 100644 --- a/content/learning-paths/automotive/_index.md +++ b/content/learning-paths/automotive/_index.md @@ -28,9 +28,11 @@ tools_software_languages_filter: - Docker: 2 - FVP: 1 - GCC: 3 +- Perf: 1 - Python: 2 - Raspberry Pi: 1 - ROS 2: 3 - Rust: 1 +- topdown-tool: 1 - Zenoh: 1 --- diff --git a/content/learning-paths/cross-platform/_example-learning-path/appendix-1-formatting.md b/content/learning-paths/cross-platform/_example-learning-path/appendix-1-formatting.md index f5a43c2979..3a13d6d5c2 100644 --- a/content/learning-paths/cross-platform/_example-learning-path/appendix-1-formatting.md +++ b/content/learning-paths/cross-platform/_example-learning-path/appendix-1-formatting.md @@ -12,7 +12,7 @@ Learning Paths are created using Markdown. Refer to this section when you have questions on how to format your content correctly. -You can also refer to other Markdown resources, and if you are unsure, look [this page in GitHub](https://github.com/jasonrandrews/arm-learning-paths/blob/main/content/learning-paths/cross-platform/_example-learning-path/appendix-1-formatting.md?plain=1) to see how to do formatting. +You can also refer to other Markdown resources, and if you are unsure, look [this page in GitHub](https://github.com/ArmDeveloperEcosystem/arm-learning-paths/blob/main/content/learning-paths/cross-platform/_example-learning-path/appendix-1-formatting.md?plain=1) to see how to do formatting. ## Learning Path Formatting diff --git a/content/learning-paths/cross-platform/topdown-compare/1-top-down.md b/content/learning-paths/cross-platform/topdown-compare/1-top-down.md index de65d5cd6f..d87ebb3e3e 100644 --- a/content/learning-paths/cross-platform/topdown-compare/1-top-down.md +++ b/content/learning-paths/cross-platform/topdown-compare/1-top-down.md @@ -1,197 +1,33 @@ --- -title: Top-down performance analysis +title: "Analyze Intel x86 and Arm Neoverse top-down performance methodologies" weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## What are the differences between Arm and x86 PMU counters? +## What are the differences between Arm and Intel x86 PMU counters? -This is a common question from software developers and performance engineers. +This is a common question from both software developers and performance engineers working across architectures. -Both Arm and x86 CPUs provide sophisticated Performance Monitoring Units (PMUs) with hundreds of hardware counters. Instead of trying to list all available counters and compare microarchitecture, it makes more sense to focus on the performance methodologies they enable and the calculations used for performance metrics. +Both Intel x86 and Arm Neoverse CPUs provide sophisticated Performance Monitoring Units (PMUs) with hundreds of hardware counters. Instead of trying to list all available counters and compare microarchitecture, it makes more sense to focus on the performance methodologies they enable and the calculations used for performance metrics. -While the specific counter names and formulas differ between architectures, both have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four buckets: Retiring, Bad Speculation, Frontend Bound, and Backend Bound. +While the specific counter names and formulas differ between architectures, both Intel x86 and Arm Neoverse have converged on top-down performance analysis methodologies that categorize performance bottlenecks into four key areas: -This Learning Path provides a comparison of how Arm and x86 processors implement top-down -analysis, highlighting the similarities in approach while explaining the architectural differences in counter events and formulas. +**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches. Additionally, **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, and **Backend Bound** covers slots stalled by execution resource constraints. + +This Learning Path provides a comparison of how x86 processors implement four-level hierarchical top-down analysis compared to Arm Neoverse's two-stage methodology, highlighting the similarities in approach while explaining the architectural differences in PMU counter events and formulas. ## Introduction to top-down performance analysis -Top-down methodology makes performance analysis easier by shifting focus from individual performance -counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories. +The top-down methodology makes performance analysis easier by shifting focus from individual PMU counters to pipeline slot utilization. Instead of trying to interpret dozens of seemingly unrelated metrics, you can systematically identify bottlenecks by attributing each CPU pipeline slot to one of four categories. -- Retiring: pipeline slots that successfully complete useful work -- Bad Speculation: slots wasted on mispredicted branches -- Frontend Bound: slots stalled due to instruction fetch/decode limitations -- Backend Bound: slots stalled due to execution resource constraints +**Retiring** represents pipeline slots that successfully complete useful work, while **Bad Speculation** accounts for slots wasted on mispredicted branches and pipeline flushes. **Frontend Bound** identifies slots stalled due to instruction fetch and decode limitations, whereas **Backend Bound** covers slots stalled by execution resource constraints such as cache misses or arithmetic unit availability. The methodology uses a hierarchical approach that allows you to drill down only into the dominant bottleneck category, and avoid the complexity of analyzing all possible performance issues at the same time. -The next sections compare the Intel x86 methodology with the Arm top-down methodology. AMD also has an equivalent top-down methodology which is similar to Intel, but uses different counters and calculations. - -## Intel x86 top-down methodology - -Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process operations. More slots means more work can be done. The number of slots depends on the design but current processor designs have 4, 6, or 8 slots. - -### Hierarchical Structure - -Intel uses a multi-level hierarchy that typically extends to 4 levels of detail. - -**Level 1 (Top-Level):** - -At Level 1, all pipeline slots are attributed to one of four categories, providing a high-level view of whether the CPU is doing useful work or stalling. - -- Retiring = `UOPS_RETIRED.RETIRE_SLOTS / SLOTS` -- Bad Speculation = `(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + N * RECOVERY_CYCLES) / SLOTS` -- Frontend Bound = `IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS` -- Backend Bound = `1 - (Frontend + Bad Spec + Retiring)` - -Where `SLOTS = 4 * CPU_CLK_UNHALTED.THREAD` on most Intel cores. - -**Level 2 breakdown:** - -Level 2 drills into each of these to identify broader causes, such as distinguishing between frontend latency and bandwidth limits, or between memory and core execution stalls in the backend. - -- Frontend Bound covers frontend latency vs. frontend bandwidth -- Backend Bound covers memory bound vs. core bound -- Bad Speculation covers branch mispredicts vs. machine clears -- Retiring covers base vs. microcode sequencer - -**Level 3 breakdown:** - -Level 3 provides fine-grained attribution, pinpointing specific bottlenecks like DRAM latency, cache misses, or port contention, which makes it possible to identify the exact root cause and apply targeted optimizations. - -- Memory Bound includes L1 Bound, L2 Bound, L3 Bound, DRAM Bound, Store Bound -- Core Bound includes Divider, Ports Utilization -- And many more specific categories - -**Level 4 breakdown:** - -Level 4 provides the specific microarchitecture events that cause the inefficiencies. - -### Key Performance Events - -Intel processors expose hundreds of performance events, but top-down analysis relies on a core set: - -| Event Name | Purpose | -| :---------------------------------------------- | :----------------------------------------------------------------------------------- | -| `UOPS_RETIRED.RETIRE_SLOTS` | Count retired micro-operations (Retiring) | -| `UOPS_ISSUED.ANY` | Count issued micro-operations (helps quantify Bad Speculation) | -| `IDQ_UOPS_NOT_DELIVERED.CORE` | Frontend delivery failures (Frontend Bound) | -| `CPU_CLK_UNHALTED.THREAD` | Core clock cycles (baseline for normalization) | -| `BR_MISP_RETIRED.ALL_BRANCHES` | Branch mispredictions (Bad Speculation detail) | -| `MACHINE_CLEARS.COUNT` | Pipeline clears due to memory ordering or faults (Bad Speculation detail) | -| `CYCLE_ACTIVITY.STALLS_TOTAL` | Total stall cycles (baseline for backend breakdown) | -| `CYCLE_ACTIVITY.STALLS_MEM_ANY` | Aggregate stalls from memory hierarchy misses (Backend → Memory Bound) | -| `CYCLE_ACTIVITY.STALLS_L1D_MISS` | Stalls due to L1 data cache misses | -| `CYCLE_ACTIVITY.STALLS_L2_MISS` | Stalls waiting on L2 cache misses | -| `CYCLE_ACTIVITY.STALLS_L3_MISS` | Stalls waiting on last-level cache misses | -| `MEM_LOAD_RETIRED.L1_HIT` / `L2_HIT` / `L3_HIT` | Track where loads are satisfied in the cache hierarchy | -| `MEM_LOAD_RETIRED.L3_MISS` | Loads missing LLC and going to memory | -| `MEM_LOAD_RETIRED.DRAM_HIT` | Loads serviced by DRAM (DRAM Bound detail) | -| `OFFCORE_RESPONSE.*` | Detailed classification of off-core responses (L3 vs. DRAM, local vs. remote socket) | - - -Using the above levels of metrics you can find out which of the 4 top-level categories are causing bottlenecks. - -### Arm top-down methodology - -Arm developed a similar top-down methodology for Neoverse server cores. The Arm architecture uses an 8-slot rename unit for pipeline bandwidth accounting. - -### Two-Stage Approach - -Unlike Intel's hierarchical model, Arm employs a two-stage methodology: - -**Stage 1: Topdown analysis** - -- Identifies high-level bottlenecks using the same four categories -- Uses Arm-specific PMU events and formulas -- Slot-based accounting similar to Intel but with Arm event names - -**Stage 2: Micro-architecture exploration** - -- Resource-specific effectiveness metrics grouped by CPU component -- Industry-standard metrics like MPKI (Misses Per Kilo Instructions) -- Detailed breakdown without strict hierarchical drilling - -### Stage 1 formulas - -Arm uses different top-down metrics based on different events but the concept is similar. - -| Metric | Formula | Purpose | -| :-- | :-- | :-- | -| Backend bound | `100 * (STALL_SLOT_BACKEND / (CPU_CYCLES * 8))` | Backend resource constraints | -| Frontend bound | `100 * ((STALL_SLOT_FRONTEND / (CPU_CYCLES * 8)) - (BR_MIS_PRED / (4 * CPU_CYCLES)))` | Frontend delivery issues | -| Bad speculation | `100 * (1 - (OP_RETIRED/OP_SPEC)) * (1 - (STALL_SLOT/(CPU_CYCLES * 8))) + (BR_MIS_PRED / (4 * CPU_CYCLES))` | Misprediction recovery | -| Retiring | `100 * (OP_RETIRED/OP_SPEC) * (1 - (STALL_SLOT/(CPU_CYCLES * 8)))` | Useful work completed | - -### Stage 2 resource groups - -Instead of hierarchical levels, Arm organizes detailed metrics into effectiveness groups as shown below: - -- Branch Effectiveness: Misprediction rates, MPKI -- ITLB/DTLB Effectiveness: Translation lookaside buffer efficiency -- L1I/L1D/L2/LL Cache Effectiveness: Cache hit ratios and MPKI -- Operation Mix: Breakdown of instruction types (SIMD, integer, load/store) -- Cycle Accounting: Frontend vs. backend stall percentages - -### Key performance events - -Neoverse cores expose approximately 100 hardware events optimized for server workloads, including: - -| Event Name | Purpose / Usage | -| :-------------------- | :--------------------------------------------------------------------------------------- | -| `CPU_CYCLES` | Core clock cycles (baseline for normalization). | -| `OP_SPEC` | Speculatively executed micro-operations (used as slot denominator). | -| `OP_RETIRED` | Retired micro-operations (used to measure useful work). | -| `INST_RETIRED` | Instructions retired (architectural measure; used for IPC, MPKI normalization). | -| `INST_SPEC` | Instructions speculatively executed (needed for operation mix and speculation analysis). | -| `STALL_SLOT` | Total stall slots (foundation for efficiency metrics). | -| `STALL_SLOT_FRONTEND` | Stall slots due to frontend resource constraints. | -| `STALL_SLOT_BACKEND` | Stall slots due to backend resource constraints. | -| `BR_RETIRED` | Branches retired (baseline for branch misprediction ratio). | -| `BR_MIS_PRED_RETIRED` | Mispredicted branches retired (branch effectiveness, speculation waste). | -| `L1I_CACHE_REFILL` | Instruction cache refills (frontend stalls due to I-cache misses). | -| `ITLB_WALK` | Instruction TLB walks (frontend stalls due to translation). | -| `L1D_CACHE_REFILL` | Data cache refills (backend stalls due to L1D misses). | -| `L2D_CACHE_REFILL` | Unified L2 cache refills (backend stalls from L2 misses). | -| `LL_CACHE_MISS_RD` | Last-level/system cache read misses (backend stalls from LLC/memory). | -| `DTLB_WALK` | Data TLB walks (backend stalls due to translation). | -| `MEM_ACCESS` | Total memory accesses (baseline for cache/TLB effectiveness ratios). | - - -## Arm compared to x86 - -### Conceptual similarities - -Both architectures adhere to the same fundamental top-down performance analysis philosophy: - -1. Four-category classification: Retiring, Bad Speculation, Frontend Bound, Backend Bound -2. Slot-based accounting: Pipeline utilization measured in issue or rename slots -3. Hierarchical analysis: Broad classification followed by drill-down into dominant bottlenecks -4. Resource attribution: Map performance issues to specific CPU micro-architectural components - -### Key Differences - -| Aspect | x86 Intel | Arm Neoverse | -| :-- | :-- | :-- | -| Hierarchy Model | Multi-level tree (Level 1 → Level 2 → Level 3+) | Two-stage: Topdown Level 1 + Resource Groups | -| Slot Width | 4 issue slots per cycle (typical) | 8 rename slots per cycle (Neoverse V1) | -| Formula Basis | Micro-operation (uop) centric | Operation and cycle centric | -| Event Naming | Intel-specific mnemonics | Arm-specific mnemonics | -| Drill-down Strategy | Strict hierarchical descent | Exploration by resource groups | - -### Event Mapping Examples - -| Performance Question | x86 Intel Events | Arm Neoverse Events | -| :-- | :-- | :-- | -| Frontend bound? | `IDQ_UOPS_NOT_DELIVERED.*` | `STALL_SLOT_FRONTEND` | -| Bad speculation? | `BR_MISP_RETIRED.*` | `BR_MIS_PRED_RETIRED` | -| Memory bound? | `CYCLE_ACTIVITY.STALLS_L3_MISS` | `L1D_CACHE_REFILL`, `L2D_CACHE_REFILL` | -| Cache effectiveness? | `MEM_LOAD_RETIRED.L3_MISS_PS` | Cache refill metrics / Cache access metrics | - -While it doesn't make sense to directly compare PMU counters for the Arm and x86 architectures, it is useful to understand the top-down methodologies for each so you can do effective performance analysis and compare you code running on each architecture. +The next sections compare the Intel x86 methodology with the Arm top-down methodology. -Continue to the next step to try a code example. \ No newline at end of file +{{% notice Note %}} +AMD also has an equivalent top-down methodology which is similar to Intel, but uses different counters and calculations. +{{% /notice %}} diff --git a/content/learning-paths/cross-platform/topdown-compare/1a-intel.md b/content/learning-paths/cross-platform/topdown-compare/1a-intel.md new file mode 100644 index 0000000000..4ff98e1b1b --- /dev/null +++ b/content/learning-paths/cross-platform/topdown-compare/1a-intel.md @@ -0,0 +1,67 @@ +--- +title: "Implement Intel x86 4-level hierarchical top-down analysis" +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Configure slot-based accounting with Intel x86 PMU counters + +Intel uses a slot-based accounting model where each CPU cycle provides multiple issue slots. A slot is a hardware resource needed to process micro-operations (uops). More slots means more work can be done per cycle. The number of slots depends on the microarchitecture design but current Intel processor designs typically have four issue slots per cycle. + +Intel's methodology uses a multi-level hierarchy that extends to 4 levels of detail. Each level provides progressively more granular analysis, allowing you to drill down from high-level categories to specific microarchitecture events. + +## Level 1: Identify top-level performance categories + +At Level 1, all pipeline slots are attributed to one of four categories, providing a high-level view of whether the CPU is doing useful work or stalling. + +- Retiring = `UOPS_RETIRED.RETIRE_SLOTS / SLOTS` +- Bad Speculation = `(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + N * RECOVERY_CYCLES) / SLOTS` +- Frontend Bound = `IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS` +- Backend Bound = `1 - (Frontend + Bad Spec + Retiring)` + +Where `SLOTS = 4 * CPU_CLK_UNHALTED.THREAD` on most Intel cores. + +## Level 2: Analyze broader bottleneck causes + +Once you've identified the dominant Level 1 category, Level 2 drills into each area to identify broader causes. This level distinguishes between frontend latency and bandwidth limits, or between memory and core execution stalls in the backend. + +- Frontend Bound covers frontend latency in comparison with frontend bandwidth +- Backend Bound covers memory bound in comparison with core bound +- Bad Speculation covers branch mispredicts in comparison with machine clears +- Retiring covers base in comparison with microcode sequencer + +## Level 3: Target specific microarchitecture bottlenecks + +After identifying broader cause categories in Level 2, Level 3 provides fine-grained attribution that pinpoints specific bottlenecks like DRAM latency, cache misses, or port contention. This precision makes it possible to identify the exact root cause and apply targeted optimizations. Memory Bound expands into detailed cache hierarchy analysis including L1 Bound, L2 Bound, L3 Bound, DRAM Bound, and Store Bound categories, while Core Bound breaks down into execution unit constraints such as Divider and Ports Utilization, along with many other specific microarchitecture-level categories that enable precise performance tuning. + +## Level 4: Access specific PMU counter events + +The final level provides direct access to the specific microarchitecture events that cause the inefficiencies. At this level, you work directly with raw PMU counter values to understand the underlying hardware behavior causing performance bottlenecks. This enables precise tuning by identifying exactly which execution units, cache levels, or pipeline stages are limiting performance, allowing you to apply targeted code optimizations or hardware configuration changes. + +## Apply essential Intel x86 PMU counters for analysis + +Intel processors expose hundreds of performance events, but top-down analysis relies on a core set of counters that map directly to the four-level hierarchy: + +| Event Name | Purpose | +| :---------------------------------------------- | :----------------------------------------------------------------------------------- | +| `UOPS_RETIRED.RETIRE_SLOTS` | Count retired micro-operations (Retiring) | +| `UOPS_ISSUED.ANY` | Count issued micro-operations (helps quantify Bad Speculation) | +| `IDQ_UOPS_NOT_DELIVERED.CORE` | Frontend delivery failures (Frontend Bound) | +| `CPU_CLK_UNHALTED.THREAD` | Core clock cycles (baseline for normalization) | +| `BR_MISP_RETIRED.ALL_BRANCHES` | Branch mispredictions (Bad Speculation detail) | +| `MACHINE_CLEARS.COUNT` | Pipeline clears due to memory ordering or faults (Bad Speculation detail) | +| `CYCLE_ACTIVITY.STALLS_TOTAL` | Total stall cycles (baseline for backend breakdown) | +| `CYCLE_ACTIVITY.STALLS_MEM_ANY` | Aggregate stalls from memory hierarchy misses (Backend → Memory Bound) | +| `CYCLE_ACTIVITY.STALLS_L1D_MISS` | Stalls due to L1 data cache misses | +| `CYCLE_ACTIVITY.STALLS_L2_MISS` | Stalls waiting on L2 cache misses | +| `CYCLE_ACTIVITY.STALLS_L3_MISS` | Stalls waiting on last-level cache misses | +| `MEM_LOAD_RETIRED.L1_HIT` / `L2_HIT` / `L3_HIT` | Track where loads are satisfied in the cache hierarchy | +| `MEM_LOAD_RETIRED.L3_MISS` | Loads missing LLC and going to memory | +| `MEM_LOAD_RETIRED.DRAM_HIT` | Loads serviced by DRAM (DRAM Bound detail) | +| `OFFCORE_RESPONSE.*` | Detailed classification of off-core responses (L3 vs. DRAM, local vs. remote socket) | + + +Using the above levels of metrics you can find out which of the four top-level categories are causing bottlenecks. + diff --git a/content/learning-paths/cross-platform/topdown-compare/1b-arm.md b/content/learning-paths/cross-platform/topdown-compare/1b-arm.md new file mode 100644 index 0000000000..7ce61660b9 --- /dev/null +++ b/content/learning-paths/cross-platform/topdown-compare/1b-arm.md @@ -0,0 +1,61 @@ +--- +title: "Implement Arm Neoverse 2-stage top-down analysis" +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- +## Explore Arm's approach to performance analysis + +After understanding Intel's comprehensive 4-level hierarchy, you can explore how Arm approached the same performance analysis challenge with a different philosophy. Arm developed a complementary top-down methodology specifically for Neoverse server cores that prioritizes practical usability while maintaining analysis effectiveness. + +The Arm Neoverse architecture uses an 8-slot rename unit for pipeline bandwidth accounting, differing from Intel's issue-slot model. Unlike Intel's hierarchical model, Arm employs a streamlined two-stage methodology that balances analysis depth with practical usability. + +### Execute Stage 1: Calculate top-down performance categories + +Stage 1 identifies high-level bottlenecks using the same four categories as Intel but with Arm-specific PMU events and formulas. This stage uses slot-based accounting similar to Intel's approach while employing Arm event names and calculations tailored to the Neoverse architecture. + +#### Configure Arm-specific PMU counter formulas + +Arm uses different top-down metrics based on different events but the concept remains similar to Intel's approach. The key difference lies in the formula calculations and slot accounting methodology: + +| Metric | Formula | Purpose | +| :-- | :-- | :-- | +| Backend bound | `100 * (STALL_SLOT_BACKEND / (CPU_CYCLES * 8))` | Backend resource constraints | +| Frontend bound | `100 * ((STALL_SLOT_FRONTEND / (CPU_CYCLES * 8)) - (BR_MIS_PRED / (4 * CPU_CYCLES)))` | Frontend delivery issues | +| Bad speculation | `100 * (1 - (OP_RETIRED/OP_SPEC)) * (1 - (STALL_SLOT/(CPU_CYCLES * 8))) + (BR_MIS_PRED / (4 * CPU_CYCLES))` | Misprediction recovery | +| Retiring | `100 * (OP_RETIRED/OP_SPEC) * (1 - (STALL_SLOT/(CPU_CYCLES * 8)))` | Useful work completed | + +### Execute Stage 2: Explore resource effectiveness groups + +Stage 2 focuses on resource-specific effectiveness metrics grouped by CPU component. This stage provides industry-standard metrics like MPKI (Misses Per Kilo Instructions) and offers detailed breakdown without the strict hierarchical drilling required by Intel's methodology. + +#### Navigate resource groups without hierarchical constraints + +Instead of Intel's hierarchical levels, Arm organizes detailed metrics into effectiveness groups that can be explored independently. **Branch Effectiveness** provides misprediction rates and MPKI, while **ITLB/DTLB Effectiveness** measures translation lookaside buffer efficiency. **Cache Effectiveness** groups (L1I/L1D/L2/LL) deliver cache hit ratios and MPKI across the memory hierarchy. Additionally, **Operation Mix** breaks down instruction types (SIMD, integer, load/store), and **Cycle Accounting** tracks frontend versus backend stall percentages. + +## Apply essential Arm Neoverse PMU counters for analysis + +Neoverse cores expose approximately 100 hardware events optimized for server workloads. The core set for top-down analysis includes: + +| Event Name | Purpose / Usage | +| :-------------------- | :--------------------------------------------------------------------------------------- | +| `CPU_CYCLES` | Core clock cycles (baseline for normalization). | +| `OP_SPEC` | Speculatively executed micro-operations (used as slot denominator). | +| `OP_RETIRED` | Retired micro-operations (used to measure useful work). | +| `INST_RETIRED` | Instructions retired (architectural measure; used for IPC, MPKI normalization). | +| `INST_SPEC` | Instructions speculatively executed (needed for operation mix and speculation analysis). | +| `STALL_SLOT` | Total stall slots (foundation for efficiency metrics). | +| `STALL_SLOT_FRONTEND` | Stall slots due to frontend resource constraints. | +| `STALL_SLOT_BACKEND` | Stall slots due to backend resource constraints. | +| `BR_RETIRED` | Branches retired (baseline for branch misprediction ratio). | +| `BR_MIS_PRED_RETIRED` | Mispredicted branches retired (branch effectiveness, speculation waste). | +| `L1I_CACHE_REFILL` | Instruction cache refills (frontend stalls due to I-cache misses). | +| `ITLB_WALK` | Instruction TLB walks (frontend stalls due to translation). | +| `L1D_CACHE_REFILL` | Data cache refills (backend stalls due to L1D misses). | +| `L2D_CACHE_REFILL` | Unified L2 cache refills (backend stalls from L2 misses). | +| `LL_CACHE_MISS_RD` | Last-level/system cache read misses (backend stalls from LLC/memory). | +| `DTLB_WALK` | Data TLB walks (backend stalls due to translation). | +| `MEM_ACCESS` | Total memory accesses (baseline for cache/TLB effectiveness ratios). | + + diff --git a/content/learning-paths/cross-platform/topdown-compare/1c-compare-arch.md b/content/learning-paths/cross-platform/topdown-compare/1c-compare-arch.md new file mode 100644 index 0000000000..b87f0b03b0 --- /dev/null +++ b/content/learning-paths/cross-platform/topdown-compare/1c-compare-arch.md @@ -0,0 +1,41 @@ +--- +title: "Evaluate cross-platform PMU counter differences" +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- +## Contrast Intel and Arm Neoverse implementation approaches + +After understanding each architecture's methodology individually, you can now examine how they differ in implementation while achieving equivalent analysis capabilities. Both architectures implement the same fundamental approach with architecture-specific adaptations: + +- Slot-based accounting: pipeline utilization measured in issue or rename slots +- Hierarchical analysis: broad classification followed by drill-down into dominant bottlenecks +- Resource attribution: map performance issues to specific CPU micro-architectural components + +## Compare 4-level hierarchical and 2-stage methodologies + +| Aspect | Intel x86 | Arm Neoverse | +| :-- | :-- | :-- | +| Hierarchy Model | Multi-level tree (Level 1 → Level 2 → Level 3+) | Two-stage: Topdown Level 1 + Resource Groups | +| Slot Width | 4 issue slots per cycle (typical) | 8 rename slots per cycle (Neoverse V1) | +| Formula Basis | Micro-operation (uop) centric | Operation and cycle centric | +| Event Naming | Intel-specific mnemonics | Arm-specific mnemonics | +| Drill-down Strategy | Strict hierarchical descent | Exploration by resource groups | + +## Map equivalent PMU counters across architectures + +| Performance Question | x86 Intel Events | Arm Neoverse Events | +| :-- | :-- | :-- | +| Frontend bound? | `IDQ_UOPS_NOT_DELIVERED.*` | `STALL_SLOT_FRONTEND` | +| Bad speculation? | `BR_MISP_RETIRED.*` | `BR_MIS_PRED_RETIRED` | +| Memory bound? | `CYCLE_ACTIVITY.STALLS_L3_MISS` | `L1D_CACHE_REFILL`, `L2D_CACHE_REFILL` | +| Cache effectiveness? | `MEM_LOAD_RETIRED.L3_MISS_PS` | Cache refill metrics / Cache access metrics | + +While PMU counter names and calculation formulas differ significantly between Intel x86 and Arm Neoverse architectures, both provide equivalent top-down analysis capabilities. Understanding these methodological differences enables effective cross-platform performance optimization: + +- **Intel x86**: Use `perf stat --topdown` for Level 1 analysis, then drill down through hierarchical levels +- **Arm Neoverse**: Use `topdown-tool -m Cycle_Accounting` for Stage 1, then explore resource effectiveness groups +- **Cross-platform strategy**: Focus on the four common categories while adapting tools and counter interpretations to each architecture + +Continue to the next step to see practical examples comparing both methodologies. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/topdown-compare/2-code-examples.md b/content/learning-paths/cross-platform/topdown-compare/2-code-examples.md index 1050cceb5b..bf7f23ec71 100644 --- a/content/learning-paths/cross-platform/topdown-compare/2-code-examples.md +++ b/content/learning-paths/cross-platform/topdown-compare/2-code-examples.md @@ -1,16 +1,16 @@ --- -title: Performance analysis code example -weight: 4 +title: "Measure cross-platform performance with topdown-tool and Perf PMU counters" +weight: 7 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Example code +## Cross-platform performance analysis example -To compare top-down on Arm and x86 you can run a small example to gain some practical experience. +To compare x86 and Arm Neoverse top-down methodologies, you can run a backend-bound benchmark that demonstrates PMU counter differences between architectures. -You can prepare the application and test it on both x86 and Arm Linux systems. You will need a C compiler installed, [GCC](/install-guides/gcc/native/) or Clang, and [Perf](/install-guides/perf/) installed on each system. Refer to the package manager for your Linux distribution for installation information. +You can prepare the application and test it on both x86 and Arm Neoverse Linux systems. You will need a C compiler installed, [GCC](/install-guides/gcc/native/) or Clang, and [Perf](/install-guides/perf/) installed on each system. For Arm systems, you'll also need [topdown-tool](/install-guides/topdown-tool/). Refer to the package manager for your Linux distribution for installation information. Use a text editor to copy the code below to a file named `test.c` @@ -51,9 +51,7 @@ int main(int argc, char *argv[]) { } ``` -This program takes a single command-line argument specifying the number of iterations to run. It performs that many sequential floating-point divisions in a loop, using a volatile variable to prevent compiler optimization, and prints the final result. - -It's a contrived example used to create a dependency chain of high-latency operations (divisions), simulating a CPU-bound workload where each iteration must wait for the previous one to finish. +This program demonstrates a backend-bound workload that will show high `STALL_SLOT_BACKEND` on Arm Neoverse and high `Backend_Bound` percentage on x86. It takes a single command-line argument specifying the number of iterations to run. The sequential floating-point divisions create a dependency chain of high-latency operations, simulating a core-bound workload where each iteration must wait for the previous division to complete. Build the application using GCC: @@ -76,11 +74,11 @@ Performing 1000000000 dependent floating-point divisions... Done. Final result: 0.000056 ``` -## Collect x86 top-down level 1 metrics +## Collect x86 top-down Level 1 metrics with Perf -Linux Perf computes top-down level 1 breakdown as described in the previous section for Retiring, Bad Speculation, Frontend Bound, and Backend Bound. +Linux Perf computes 4-level hierarchical top-down breakdown using PMU counters like `UOPS_RETIRED.RETIRE_SLOTS`, `IDQ_UOPS_NOT_DELIVERED.CORE`, and `CPU_CLK_UNHALTED.THREAD` for the four categories: Retiring, Bad Speculation, Frontend Bound, and Backend Bound. -Use `perf stat` to on the pinned core to collect the metrics. +Use `perf stat` on the pinned core to collect Level 1 metrics: ```console taskset -c 1 perf stat -C 1 --topdown ./test 1000000000 @@ -131,18 +129,21 @@ Done. Final result: 0.000056 6.029283206 seconds time elapsed ``` -Again, showing `Backend_Bound` value very high (0.96). +Again, showing `Backend_Bound` value very high (0.96). Notice the x86-specific PMU counters: +- `uops_issued.any` and `uops_retired.retire_slots` for micro-operation accounting +- `idq_uops_not_delivered.core` for frontend delivery failures +- `cpu_clk_unhalted.thread` for cycle normalization -If you want to learn more, you can continue with the level 2 and level 3 analysis. +If you want to learn more, you can continue with the Level 2 and Level 3 hierarchical analysis. -## Use the Arm top-down methodology +## Use the Arm Neoverse 2-stage top-down methodology -Make sure you install the Arm top-down tool. +Arm's approach uses a 2-stage methodology with PMU counters like `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` for Stage 1 analysis, followed by resource effectiveness groups in Stage 2. -Use the [Telemetry Solution install guide](/install-guides/topdown-tool/) for information about installing `topdown-tool`. +Make sure you install the Arm topdown-tool using the [Telemetry Solution install guide](/install-guides/topdown-tool/). -Collect instructions per cycle (IPC): +Collect Stage 2 general metrics including Instructions Per Cycle (IPC): ```console taskset -c 1 topdown-tool -m General ./test 1000000000 @@ -159,7 +160,7 @@ Stage 2 (uarch metrics) Instructions Per Cycle 0.355 per cycle ``` -Connect the stage 1 metrics: +Collect the Stage 1 topdown metrics using Arm's cycle accounting: ```console taskset -c 1 topdown-tool -m Cycle_Accounting ./test 1000000000 @@ -177,7 +178,7 @@ Frontend Stalled Cycles 0.04% cycles Backend Stalled Cycles. 88.15% cycles ``` -This confirms the example has high backend stalls as on x86. +This confirms the example has high backend stalls equivalent to x86's Backend_Bound category. Notice how Arm's Stage 1 uses percentage of cycles rather than Intel's slot-based accounting. You can continue to use the `topdown-tool` for additional microarchitecture exploration. @@ -199,7 +200,7 @@ L1D Cache MPKI............... 0.023 misses per 1,000 instructions L1D Cache Miss Ratio......... 0.000 per cache access ``` -For L1 instruction cache: +For L1 instruction cache effectiveness: ```console taskset -c 1 topdown-tool -m L1D_Cache_Effectiveness ./test 1000000000 @@ -260,9 +261,11 @@ Crypto Operations Percentage........ 0.00% operations ``` -## Summary +## Cross-architecture performance analysis summary + +Both Arm Neoverse and modern x86 cores expose hardware PMU events that enable equivalent top-down analysis, despite different counter names and calculation methods. Intel x86 processors use a four-level hierarchical methodology based on slot-based pipeline accounting, relying on PMU counters such as `UOPS_RETIRED.RETIRE_SLOTS`, `IDQ_UOPS_NOT_DELIVERED.CORE`, and `CPU_CLK_UNHALTED.THREAD` to break down performance into retiring, bad speculation, frontend bound, and backend bound categories. Linux Perf serves as the standard collection tool, using commands like `perf stat --topdown` and the `-M topdownl1` option for detailed breakdowns. -Both Arm Neoverse and modern x86 cores expose hardware events that Perf aggregates into the same top-down categories. Names of the PMU counters differ, but the level 1 categories are the same. +Arm Neoverse platforms implement a complementary two-stage methodology where Stage 1 focuses on topdown categories using counters such as `STALL_SLOT_BACKEND`, `STALL_SLOT_FRONTEND`, `OP_RETIRED`, and `OP_SPEC` to analyze pipeline stalls and instruction retirement. Stage 2 evaluates resource effectiveness, including cache and operation mix metrics through `topdown-tool`, which accepts the desired metric group via the `-m` argument. -If you are working on both architectures you can use the same framework with minor differences between Intel's hierarchical structure and Arm's two-stage resource groups to systematically identify and resolve performance bottlenecks. +Both architectures identify the same performance bottleneck categories, enabling similar optimization strategies across Intel and Arm platforms while accounting for methodological differences in measurement depth and analysis approach. diff --git a/content/learning-paths/cross-platform/topdown-compare/_index.md b/content/learning-paths/cross-platform/topdown-compare/_index.md index 275149ff08..d358ec630c 100644 --- a/content/learning-paths/cross-platform/topdown-compare/_index.md +++ b/content/learning-paths/cross-platform/topdown-compare/_index.md @@ -1,21 +1,23 @@ --- -title: "Compare Arm and x86 Top-Down Performance Analysis" - -minutes_to_complete: 30 +title: Compare Arm Neoverse and Intel x86 top-down performance analysis with PMU counters draft: true cascade: draft: true -who_is_this_for: This is an advanced topic for software developers who want to understand the similarities and differences between Arm and x86 top-down performance analysis. +minutes_to_complete: 30 + +who_is_this_for: This is an advanced topic for software developers and performance engineers who want to understand the similarities and differences between Arm Neoverse and Intel x86 top-down performance analysis using PMU counters, Linux Perf, and the topdown-tool. learning_objectives: - - Describe the similarities and differences between top-down performance analysis on x86 and Arm Linux systems. - - Run applications on both architectures and understand how performance analysis is done on each system. + - Compare Intel x86 4-level hierarchical top-down methodology with Arm Neoverse 2-stage approach using PMU counters + - Execute performance analysis using Linux Perf on x86 and topdown-tool on Arm systems + - Analyze Backend Bound, Frontend Bound, Bad Speculation, and Retiring categories across both architectures prerequisites: - - Familiarity with performance analysis on Linux systems using Perf. - - Arm and x86 Linux systems to try code examples. + - Familiarity with performance analysis on Linux systems using Perf and PMU counters + - Access to Arm Neoverse and Intel x86 Linux systems for hands-on examples + - Basic understanding of CPU pipeline concepts and performance bottlenecks author: - Jason Andrews @@ -30,6 +32,8 @@ operatingsystems: tools_software_languages: - GCC - Clang + - Perf + - topdown-tool shared_path: true shared_between: @@ -47,7 +51,7 @@ further_reading: type: documentation - resource: title: How to use the Arm Performance Monitoring Unit and System Counter - link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/arm_pmu/). + link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/arm_pmu/ type: website diff --git a/content/learning-paths/embedded-and-microcontrollers/raspberry-pi-smart-home/4-smart-home-assistant.md b/content/learning-paths/embedded-and-microcontrollers/raspberry-pi-smart-home/4-smart-home-assistant.md index 6b5b888bcc..c3c4d58508 100644 --- a/content/learning-paths/embedded-and-microcontrollers/raspberry-pi-smart-home/4-smart-home-assistant.md +++ b/content/learning-paths/embedded-and-microcontrollers/raspberry-pi-smart-home/4-smart-home-assistant.md @@ -35,7 +35,7 @@ In the previous section, you configured a LED on GPIO pin 17. The smart home ass The code uses gpiozero with lgpio backend for Raspberry Pi 5 compatibility. You can use compatible output devices such as LEDs, relays, or small loads connected to these GPIO pins to represent actual smart home devices. All pin assignments are optimized for the Raspberry Pi 5's GPIO layout. {{% /notice %}} -![[Raspberry Pi 5 connected to a breadboard with LEDs, push button, and sensor module alt-text#center](hardware.jpeg "Setup that includes a blue LED (mapped to Living Room Light on GPIO 17), a red LED, push button, and a sensor module.") +![Raspberry Pi 5 connected to a breadboard with LEDs, push button, and sensor module alt-text#center](hardware.jpeg "Setup that includes a blue LED (mapped to Living Room Light on GPIO 17), a red LED, push button, and a sensor module.") This setup illustrates a simulated smart home with controllable devices. diff --git a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/env-setup-1.md b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/env-setup-1.md index 1fde02ba2f..72750ad6a0 100644 --- a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/env-setup-1.md +++ b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/env-setup-1.md @@ -15,7 +15,10 @@ You will train a lightweight CNN to classify images of the letters R, P, and S a ### What is a Convolutional Neural Network (CNN)? A Convolutional Neural Network (CNN) is a type of deep neural network primarily used for analyzing visual imagery. Unlike traditional neural networks, CNNs are designed to process pixel data by using a mathematical operation called convolution. This allows them to automatically and adaptively learn spatial hierarchies of features from input images, from low-level features like edges and textures to high-level features like shapes and objects. -A convolutional neural network (CNN) is a deep neural network designed to analyze visual data using the *convolution* operation. CNNs learn spatial hierarchies of features - from edges and textures to shapes and objects - directly from pixels. +![CNN architecture](typical_cnn.png) + +Typical CNN architecture by Aphex34, licensed under +[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). Common CNN applications include: diff --git a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/image.png b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/image.png deleted file mode 100644 index b548f79463..0000000000 Binary files a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/image.png and /dev/null differ diff --git a/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/typical_cnn.png b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/typical_cnn.png new file mode 100644 index 0000000000..b856034f0b Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/training-inference-pytorch/typical_cnn.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/godot_packages/add-markers.md b/content/learning-paths/mobile-graphics-and-gaming/godot_packages/add-markers.md index 176283b81b..b863ad5520 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/godot_packages/add-markers.md +++ b/content/learning-paths/mobile-graphics-and-gaming/godot_packages/add-markers.md @@ -10,7 +10,7 @@ All annotation features are provided through the `PerformanceStudio` class. To b ```gdscript var performance_studio = PerformanceStudio.new() - +``` ## Add single markers to highlight key game events diff --git a/content/learning-paths/servers-and-cloud-computing/_index.md b/content/learning-paths/servers-and-cloud-computing/_index.md index 87a21aecdc..d40cbc760a 100644 --- a/content/learning-paths/servers-and-cloud-computing/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/_index.md @@ -8,7 +8,7 @@ key_ip: maintopic: true operatingsystems_filter: - Android: 3 -- Linux: 179 +- Linux: 180 - macOS: 13 - Windows: 14 pinned_modules: @@ -24,7 +24,7 @@ subjects_filter: - Libraries: 9 - ML: 32 - Performance and Architecture: 72 -- Storage: 1 +- Storage: 2 - Web: 12 subtitle: Optimize cloud native apps on Arm for performance and cost title: Servers and Cloud Computing @@ -115,7 +115,9 @@ tools_software_languages_filter: - Java: 4 - JAX: 1 - JMH: 1 -- Kafka: 1 +- Kafka: 2 +- kafka-consumer-perf-test.sh: 1 +- kafka-producer-perf-test.sh: 1 - KEDA: 1 - Kedify: 1 - Keras: 1 @@ -153,6 +155,7 @@ tools_software_languages_filter: - Orchard Core: 1 - PAPI: 1 - perf: 6 +- Perf: 1 - PostgreSQL: 4 - Profiling: 1 - Python: 32 @@ -180,6 +183,7 @@ tools_software_languages_filter: - TensorFlow: 2 - Terraform: 11 - ThirdAI: 1 +- topdown-tool: 1 - Trusted Firmware: 1 - Trustee: 1 - TSan: 1 @@ -202,6 +206,6 @@ weight: 1 cloud_service_providers_filter: - AWS: 17 - Google Cloud: 18 -- Microsoft Azure: 17 +- Microsoft Azure: 18 - Oracle: 2 --- diff --git a/content/learning-paths/servers-and-cloud-computing/cca-device-attach/3.bounce_buffers.md b/content/learning-paths/servers-and-cloud-computing/cca-device-attach/3.bounce_buffers.md index 6884b98129..45cf4e6404 100644 --- a/content/learning-paths/servers-and-cloud-computing/cca-device-attach/3.bounce_buffers.md +++ b/content/learning-paths/servers-and-cloud-computing/cca-device-attach/3.bounce_buffers.md @@ -64,7 +64,7 @@ A bounce buffer preserves the confidentiality of other Realm data because only t ## Next steps -In the next section, you'll test this by tracing SWIOTLB activity in [Exercise: observe bounce buffers in a Realm](./lab-observe-bounce-buffers.md). +In the next section, you'll test this by tracing SWIOTLB activity. diff --git a/content/learning-paths/servers-and-cloud-computing/dotnet-migration/4-dotnet-versions.md b/content/learning-paths/servers-and-cloud-computing/dotnet-migration/4-dotnet-versions.md index b188296812..79254fa8e4 100644 --- a/content/learning-paths/servers-and-cloud-computing/dotnet-migration/4-dotnet-versions.md +++ b/content/learning-paths/servers-and-cloud-computing/dotnet-migration/4-dotnet-versions.md @@ -15,8 +15,8 @@ Understanding which versions perform best and the features they offer can help y {{% notice Support status summary %}} - .NET 8 – Current LTS (support until Nov 2026) -- .NET 9 – STS (preview; GA Q4 2025) -- .NET 10 – Next LTS (preview; expected 2025 Q4–Q1 2026) +- .NET 9 – STS (support until Nov 2026) +- .NET 10 – Next LTS (preview; expected GA Nov 2025) - .NET 3.1, 5, 6, 7 – End of life {{% /notice %}} @@ -69,7 +69,7 @@ Important Arm-related improvements include: - Smaller base container images (`mcr.microsoft.com/dotnet/aspnet:8.0` and `…/runtime:8.0`) thanks to a redesigned layering strategy, particularly beneficial on Arm where network bandwidth is often at a premium. - Garbage-collector refinements that reduce pause times on highly-threaded, many-core servers. -## .NET 9 +## .NET 9 (current STS - support until November 2026) .NET 9 is still in preview, so features may change, but public builds already show promising Arm-centric updates: diff --git a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_index.md b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_index.md index bc5f687039..a841bab945 100644 --- a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/_index.md @@ -1,5 +1,5 @@ --- -title: Learn about the impact of NIC IRQs and patterns on cloud +title: Learn about the impact of network interrupts on cloud workloads draft: true cascade: @@ -7,16 +7,15 @@ cascade: minutes_to_complete: 20 -who_is_this_for: This is anyone interested in understanding how IRQ patterns can enhance networking workload performance on cloud. - +who_is_this_for: This is a specialized topic for developers and performance engineers who are interested in understanding how network interrupt patterns can impact performance on cloud servers. learning_objectives: - - Analyze the current IRQ layout on the machine. - - Test different options and patterns to improve performance. + - Analyze the current interrupt request (IRQ) layout on an Arm Linux system + - Experiment with different interrupt options and patterns to improve performance prerequisites: - - An Arm computer running Linux installed. - - Some familiarity with running Linux command line commands. + - An Arm computer running Linux + - Some familiarity with the Linux command line author: Kiel Friedt @@ -24,7 +23,8 @@ author: Kiel Friedt skilllevels: Introductory subjects: Performance and Architecture armips: - - AArch64 + - Neoverse + - Cortex-A tools_software_languages: operatingsystems: diff --git a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/checking.md b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/checking.md index f9a798b797..91a635293c 100644 --- a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/checking.md +++ b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/checking.md @@ -1,14 +1,29 @@ --- -title: checking IRQs +title: Understand and Analyze network IRQ configuration weight: 2 ### FIXED, DO NOT MODIFY layout: learningpathall --- -First you should run the following command to identify all IRQs on the system. Identify the NIC IRQs and adjust the system by experimenting and seeing how performance improves. +## Why IRQ management matters for performance -``` +In modern cloud environments, network performance is critical to overall system efficiency. Network interface cards (NICs) generate interrupt requests (IRQs) to notify the CPU when data packets arrive or need to be sent. These interrupts temporarily pause normal processing, allowing the system to handle network traffic. + +By default, Linux distributes these network interrupts across available CPU cores. However, this distribution is not always optimal for performance: + +- High interrupt rates: In busy servers, network cards can generate thousands of interrupts per second +- CPU cache locality: Processing related network operations on the same CPU core improves cache efficiency +- Resource contention: When network IRQs compete with application workloads for the same CPU resources, both can suffer +- Power efficiency: IRQ management can help reduce unnecessary CPU wake-ups, improving energy efficiency + +Understanding and optimizing IRQ assignment allows you to balance network processing loads, reduce latency, and maximize throughput for your specific workloads. + +## Identifying IRQs on your system + +To get started, run this command to display all IRQs on your system and their CPU assignments: + +```bash grep '' /proc/irq/*/smp_affinity_list | while IFS=: read path cpus; do irq=$(basename $(dirname $path)) device=$(grep -E "^ *$irq:" /proc/interrupts | awk '{print $NF}') @@ -16,10 +31,9 @@ grep '' /proc/irq/*/smp_affinity_list | while IFS=: read path cpus; do done ``` +The output is very long and looks similar to: -{{% notice Note %}} -output should look similar to this: -``` +```output IRQ 104 -> CPUs 12 -> Device ens34-Tx-Rx-5 IRQ 105 -> CPUs 5 -> Device ens34-Tx-Rx-6 IRQ 106 -> CPUs 10 -> Device ens34-Tx-Rx-7 @@ -33,12 +47,43 @@ IRQ 21 -> CPUs 0-15 -> Device ACPI:Ged ... IRQ 26 -> CPUs 0-15 -> Device ACPI:Ged ``` -{{% /notice %}} -Now, you may notice that the NIC IRQs are assigned to a duplicate CPU by default. +## How to identify network IRQs + +Network-related IRQs can be identified by looking at the "Device" column in the output. -like this example: +You can identify network interfaces using the command: + +```bash +ip link show ``` + +Here are some common patterns to look for: + +Common interface naming patterns include `eth0` for traditional ethernet, `enP3p3s0f0` and `ens5-Tx-Rx-0` for the Linux predictable naming scheme, or `wlan0` for wireless. + +The predictable naming scheme breaks down into: + +- en = ethernet +- P3 = PCI domain 3 +- p3 = PCI bus 3 +- s0 = PCI slot 0 +- f0 = function 0 + +This naming convention helps ensure network interfaces have consistent names across reboots by encoding their physical +location in the system. + +## Improve performance + +Once you've identified the network IRQs, you can adjust their CPU assignments to try to improve performance. + +Identify the NIC (Network Interface Card) IRQs and adjust the system by experimenting and seeing if performance improves. + +You may notice that some NIC IRQs are assigned to the same CPU cores by default, creating duplicate assignments. + +For example: + +```output IRQ 100 -> CPUs 2 -> Device ens34-Tx-Rx-1 IRQ 101 -> CPUs 12 -> Device ens34-Tx-Rx-2 IRQ 102 -> CPUs 14 -> Device ens34-Tx-Rx-3 @@ -47,26 +92,30 @@ IRQ 104 -> CPUs 12 -> Device ens34-Tx-Rx-5 IRQ 105 -> CPUs 5 -> Device ens34-Tx-Rx-6 IRQ 106 -> CPUs 10 -> Device ens34-Tx-Rx-7 ``` -This can potential hurt performance. Suggestions and patterns to experiment with will be on the next step. -### reset +## Understanding IRQ performance impact -If performance reduces, you can return the IRQs back to default using the following commands. +When network IRQs are assigned to the same CPU cores (as shown in the example above where IRQ 101 and 104 both use CPU 12), this can potentially hurt performance as multiple interrupts compete for the same CPU core's attention, while other cores remain underutilized. -``` +By optimizing IRQ distribution, you can achieve more balanced processing and improved throughput. This optimization is especially important for high-traffic servers where network performance is critical. + +Suggested experiments are covered in the next section. + +### How can I reset my IRQs if I make performance worse? + +If your experiments reduce performance, you can return the IRQs back to default using the following commands: + +```bash sudo systemctl unmask irqbalance sudo systemctl enable --now irqbalance ``` -or you can run the following +If needed, install `irqbalance` on your system. For Debian based systems run: -``` -DEF=$(cat /proc/irq/default_smp_affinity) -for f in /proc/irq/*/smp_affinity; do - echo "$DEF" | sudo tee "$f" >/dev/null || true -done +```bash +sudo apt install irqbalance ``` ### Saving these changes -Any changes you make to IRQs will be reset at reboot. You will need to change your systems settings to make your changes permanent. +Any changes you make to IRQs will be reset at reboot. You will need to change your system's settings to make your changes permanent. diff --git a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/conclusion.md b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/conclusion.md index c25a4ceb1c..d73a807ef9 100644 --- a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/conclusion.md +++ b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/conclusion.md @@ -1,17 +1,51 @@ --- -title: conclusion +title: Conclusion and recommendations weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -While a single pattern does not work for all workloads. Our testing found that under heavy network workloads, different patterns performed better based on sizing. +## Optimal IRQ Management Strategies -### upto and under 16 vCPUs -For best performance, reduce NIC IRQs to either one or two cores. Otherwise random or default performed second best. +Testing across multiple cloud platforms reveals that IRQ management effectiveness varies significantly based on system size and workload characteristics. No single pattern works optimally for all scenarios, but clear patterns emerged during performance testing under heavy network loads. -*If the number of NIC IRQS are more then the number of vCPUs, concentrating them over less cores improved performance significantly. +## Recommendations by system size -### over 16 vCPUs -No pattern showed significant improvement over default as long as all NIC IRQs were not on duplicate cores. +### Systems with 16 vCPUs or less + +For smaller systems with 16 or less vCPUs, concentrated IRQ assignment may provide measurable performance improvements. + +- Assign all network IRQs to just one or two CPU cores +- This approach showed the most significant performance gains +- Most effective when the number of NIC IRQs exceeds the number of vCPUs +- Use the `smp_affinity` range assignment pattern from the previous section with a very limited core range, for example `0-1` + +Performance improves significantly when network IRQs are concentrated rather than dispersed across all available cores on smaller systems. + +### Systems with more than 16 vCPUs + +For larger systems with more than 16 vCPUs, the findings are different: + +- Default IRQ distribution generally performs well +- The primary concern is avoiding duplicate core assignments for network IRQs +- Use the scripts from the previous section to check and correct any overlapping IRQ assignments +- The paired core pattern can help ensure optimal distribution on these larger systems + +On larger systems, the overhead of interrupt handling is proportionally smaller compared to the available processing power. The main performance bottleneck occurs when multiple high-frequency network interrupts compete for the same core. + +## Implementation Considerations + +When implementing these IRQ management strategies, there are some important points to keep in mind. + +Pay attention to the workload type. CPU-bound applications may benefit from different IRQ patterns than I/O-bound applications. + +Always benchmark your specific workload with different IRQ patterns. + +Monitor IRQ counts in real-time using `watch -n1 'grep . /proc/interrupts'` to observe IRQ distribution in real-time. + +Also consider NUMA effects on multi-socket systems. Keep IRQs on cores close to the PCIe devices generating them to minimize cross-node memory access. + +Make sure to set up IRQ affinity settings in `/etc/rc.local` or a systemd service file to ensure they persist across reboots. + +Remember that as workloads and hardware evolve, revisiting and adjusting IRQ management strategies may be necessary to maintain optimal performance. diff --git a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/patterns.md b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/patterns.md index 285819a065..46c788226e 100644 --- a/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/patterns.md +++ b/content/learning-paths/servers-and-cloud-computing/irq-tuning-guide/patterns.md @@ -1,39 +1,58 @@ --- -title: patterns +title: IRQ management patterns for performance optimization weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -The following patterns were ran on multiple cloud and on a variety of sizes. A recommended IRQ pattern will be suggested at the end. Based on your workload, a different pattern may result in higher performance. +## Optimizing network performance with IRQ management + +Different IRQ management patterns can significantly impact network performance across multiple cloud platforms and virtual machine sizes. This Learning Path presents various IRQ distribution strategies, along with scripts to implement them on your systems. + +Network interrupt requests (IRQs) can be distributed across CPU cores in various ways, each with potential benefits depending on your workload characteristics and system configuration. By strategically assigning network IRQs to specific cores, you can improve cache locality, reduce contention, and potentially boost overall system performance. + +The following patterns have been tested on various systems and can be implemented using the provided scripts. An optimal pattern is suggested at the conclusion of this Learning Path, but your specific workload may benefit from a different approach. ### Patterns + 1. Default: IRQ pattern provided at boot. 2. Random: All IRQs are assigned a core and do not overlap with network IRQs. -3. Housekeeping: All IRQs outside of network IRQs are assign to specific core(s). -4. NIC IRQs are set to single or multiple ranges of cores and including pairs. EX. 1, 1-2, 0-3, 0-7, [0-1, 2-3..], etc. - +3. Housekeeping: All IRQs outside of network IRQs are assigned to specific core(s). +4. NIC IRQs are assigned to single or multiple ranges of cores, including pairs. ### Scripts to change IRQ +The scripts below demonstrate how to implement different IRQ management patterns on your system. Each script targets a specific distribution strategy: + +Before running these scripts, identify your network interface name using `ip link show` and determine your system's CPU topology with `lscpu`. Always test these changes in a non-production environment first, as improper IRQ assignment can impact system stability. + To change the NIC IRQs or IRQs in general you can use the following scripts. -Housekeeping pattern example, you will need to add more to account for other IRQs on your system +### Housekeeping -``` -HOUSEKEEP=#core range here +The housekeeping pattern isolates non-network IRQs to dedicated cores. + +You need to add more to account for other IRQs on your system. + +```bash +HOUSEKEEP=#core range here (example: "0,3") -# ACPI:Ged for irq in $(awk '/ACPI:Ged/ {sub(":","",$1); print $1}' /proc/interrupts); do echo $HOUSEKEEP | sudo tee /proc/irq/$irq/smp_affinity_list >/dev/null done ``` -This is for pairs on a 16 vCPU machine, you will need the interface name. +### Paired core -``` -IFACE=#interface name +The paired core assignment pattern distributes network IRQs across CPU core pairs for better cache coherency. + +This is for pairs on a 16 vCPU machine. + +You need to add the interface name. + +```bash +IFACE=#interface name (example: "ens5") PAIRS=("0,1" "2,3" "4,5" "6,7" "8,9" "10,11" "12,13" "14,15") @@ -49,12 +68,22 @@ for irq in "${irqs[@]}"; do done ``` -This will assign a specific core(s) to NIC IRQs only +### Range assignment -``` -IFACE=#interface name +The range assignment pattern assigns network IRQs to a specific range of cores. + +This will assign a specific core(s) to NIC IRQs only. -for irq in $(awk '/$IFACE/ {sub(":","",$1); print $1}' /proc/interrupts); do +You need to add the interface name. + +```bash +IFACE=#interface name (example: "ens5") + +for irq in $(awk '/'$IFACE'/ {sub(":","",$1); print $1}' /proc/interrupts); do echo 0-15 | sudo tee /proc/irq/$irq/smp_affinity_list > /dev/null done -``` \ No newline at end of file +``` + +Each pattern offers different performance characteristics depending on your workload. The housekeeping pattern reduces system noise, paired cores optimize cache usage, and range assignment provides dedicated network processing capacity. Test these patterns in your environment to determine which provides the best performance for your specific use case. + +Continue to the next section for additional guidance. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/kafka-azure/_index.md new file mode 100644 index 0000000000..f9b2845871 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kafka-azure/_index.md @@ -0,0 +1,60 @@ +--- +title: Deploy Kafka on the Microsoft Azure Cobalt 100 processors + +draft: true +cascade: + draft: true + +minutes_to_complete: 30 + +who_is_this_for: This Learning Path is designed for software developers looking to migrate their Kafka workloads from x86_64 to Arm-based platforms, specifically on the Microsoft Azure Cobalt 100 processors. + +learning_objectives: + - Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image. + - Deploy Kafka on the Ubuntu virtual machine. + - Perform Kafka baseline testing and benchmarking on both x86_64 and Arm64 virtual machines. + +prerequisites: + - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6). + - Basic understanding of Linux command line. + - Familiarity with the [Apache Kafka architecture](https://kafka.apache.org/) and deployment practices on Arm64 platforms. + +author: Jason Andrews + +### Tags +skilllevels: Advanced +subjects: Storage +cloud_service_providers: Microsoft Azure + +armips: + - Neoverse + +tools_software_languages: + - Kafka + - kafka-producer-perf-test.sh + - kafka-consumer-perf-test.sh + +operatingsystems: + - Linux + +further_reading: + - resource: + title: Kafka Manual + link: https://kafka.apache.org/documentation/ + type: documentation + - resource: + title: Kafka Performance Tool + link: https://codemia.io/knowledge-hub/path/use_kafka-producer-perf-testsh_how_to_set_producer_config_at_kafka_210-0820 + type: documentation + - resource: + title: Kafka on Azure + link: https://learn.microsoft.com/en-us/samples/azure/azure-quickstart-templates/kafka-ubuntu-multidisks/ + type: documentation + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/kafka-azure/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kafka-azure/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/background.md b/content/learning-paths/servers-and-cloud-computing/kafka-azure/background.md new file mode 100644 index 0000000000..48990a4d0a --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kafka-azure/background.md @@ -0,0 +1,20 @@ +--- +title: "Overview" + +weight: 2 + +layout: "learningpathall" +--- + +## Cobalt 100 Arm-based processor + +Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance. + +To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). + +## Apache Kafka +Apache Kafka is a high-performance, open-source distributed event streaming platform designed for building real-time data pipelines and streaming applications. + +It allows you to publish, subscribe to, store, and process streams of records in a fault-tolerant and scalable manner. Kafka stores data in topics, which are partitioned and replicated across a cluster to ensure durability and high availability. + +Kafka is widely used for messaging, log aggregation, event sourcing, real-time analytics, and integrating large-scale data systems. Learn more from the [Apache Kafka official website](https://kafka.apache.org/) and its [official documentation](https://kafka.apache.org/documentation). diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/baseline.md b/content/learning-paths/servers-and-cloud-computing/kafka-azure/baseline.md new file mode 100644 index 0000000000..46453417d3 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kafka-azure/baseline.md @@ -0,0 +1,104 @@ +--- +title: Baseline Testing +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Run a Baseline test with Kafka + +After installing Kafka on your Arm64 virtual machine, you can perform a simple baseline test to validate that Kafka runs correctly and produces the expected output. + +Kafka 4.1.0 uses **KRaft**, which integrates the control and data planes, eliminating the need for a separate ZooKeeper instance. + +We need 4 terminals to complete this test. The first will start the Kafka server, the second will create a topic, and the final two will send and receive messages, respectively. + +### Initial Setup: Configure & Format KRaft +**KRaft** is Kafka's new metadata protocol that integrates the responsibilities of ZooKeeper directly into Kafka, simplifying deployment and improving scalability by making the brokers self-managing. + +First, you must configure your `server.properties` file for KRaft and format the storage directory. These steps are done only once. + +**1. Edit the Configuration File**: Open your `server.properties` file. + +```console +nano /opt/kafka/config/server.properties +``` + +**2. Add/Modify KRaft Properties:** Ensure the following lines are present and correctly configured for a single-node setup. + +This configuration file sets up a single Kafka server to act as both a **controller** (managing cluster metadata) and a broker (handling data), running in **KRaft** mode. It defines the node's unique ID and specifies the local host as the sole participant in the **controller** quorum. + +```java +process.roles=controller,broker +node.id=1 +controller.quorum.voters=1@localhost:9093 +listeners=PLAINTEXT://:9092,CONTROLLER://:9093 +advertised.listeners=PLAINTEXT://localhost:9092 +log.dirs=/tmp/kraft-combined-logs +``` +**3. Format the Storage Directory:** Use the `kafka-storage.sh` tool to format the metadata directory. + +```console +bin/kafka-storage.sh format -t $(bin/kafka-storage.sh random-uuid) -c config/server.properties +``` +You should see an output similar to: + +```output +Formatting metadata directory /tmp/kraft-combined-logs with metadata.version 4.1-IV1. +``` + +Now, Perform the Baseline Test + +### Terminal 1 – Start Kafka Broker +This command starts the Kafka broker (the main server that sends and receives messages) in KRaft mode. Keep this terminal open. + +```console +cd /opt/kafka +bin/kafka-server-start.sh config/server.properties +``` +### Terminal 2 – Create a Topic +This command creates a new Kafka topic named `test-topic-kafka` (like a channel where messages will be stored and shared) with 1 partition and 1 copy (replica). + +```console +cd /opt/kafka +bin/kafka-topics.sh --create --topic test-topic-kafka --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1 +``` +You should see output similar to: + +```output +Created topic test-topic-kafka. +``` + +- **Verify topic** + +```console +bin/kafka-topics.sh --list --bootstrap-server localhost:9092 +``` +You should see output similar to: + +```output +__consumer_offsets +test-topic-kafka +``` + +### Terminal 3 – Console Producer (Write Message) +This command starts the **Kafka Producer**, which lets you type and send messages into the `test-topic-kafka` topic. For example, when you type `hello from azure vm`, this message will be delivered to any Kafka consumer subscribed to that topic. + +```console +cd /opt/kafka +bin/kafka-console-producer.sh --topic test-topic-kafka --bootstrap-server localhost:9092 +``` +You should see an empty prompt where you can start typing. Type `hello from azure arm vm` and press **Enter**. + +### Terminal 4 – Console Consumer (Read Message) +This command starts the **Kafka Consumer**, which listens to the `test-topic-kafka` topic and displays all messages from the beginning. + +```console +cd /opt/kafka +bin/kafka-console-consumer.sh --topic test-topic-kafka --from-beginning --bootstrap-server localhost:9092 +``` + +You should see your message `hello from azure arm vm` displayed in this terminal, confirming that the producer's message was successfully received. + +Now you can proceed to benchmarking Kafka’s performance on the Azure Cobalt 100 Arm virtual machine. diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/kafka-azure/benchmarking.md new file mode 100644 index 0000000000..051663dc9a --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kafka-azure/benchmarking.md @@ -0,0 +1,118 @@ +--- +title: Benchmarking with Official Kafka Tools +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Benchmark Kafka on Azure Cobalt 100 Arm-based instances and x86_64 instances + +Kafka’s official performance tools (**kafka-producer-perf-test.sh** and **kafka-consumer-perf-test.sh**) let you generate test workloads, measure message throughput, and record end-to-end latency. + +## Steps for Kafka Benchmarking + +Before starting the benchmark, ensure that the **Kafka broker** are already running in separate terminals. + +Now, open two new terminals—one for the **producer benchmark** and another for the **consumer benchmark**. + +### Terminal A - Producer Benchmark + +The producer benchmark measures how fast Kafka can send messages, reporting throughput and latency percentiles. + +```console +cd /opt/kafka +bin/kafka-producer-perf-test.sh \ + --topic test-topic-kafka \ + --num-records 1000000 \ + --record-size 100 \ + --throughput -1 \ + --producer-props bootstrap.servers=localhost:9092 +``` +You should see output similar to: + +```output +1000000 records sent, 252589.0 records/sec (24.09 MB/sec), 850.85 ms avg latency, 1219.00 ms max latency, 851 ms 50th, 1184 ms 95th, 1210 ms 99th, 1218 ms 99.9th. +``` +### Terminal B - Consumer benchmark + +The consumer benchmark measures how fast Kafka can read messages from the topic, reporting throughput and total messages consumed. + +```console +cd /opt/kafka +bin/kafka-consumer-perf-test.sh \ + --topic test-topic-kafka \ + --bootstrap-server localhost:9092 \ + --messages 1000000 \ + --timeout 30000 +``` +You should see output similar to: + +```output +start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec +2025-09-03 06:07:13:616, 2025-09-03 06:07:17:545, 95.3674, 24.2727, 1000001, 254517.9435, 3354, 575, 165.8564, 1739132.1739 +``` + +## Benchmark Results Table Explained: + +- **Messages Processed** – Total number of messages handled during the test. +- **Records/sec** – Rate of messages sent or consumed per second. +- **MB/sec** – Data throughput in megabytes per second. +- **Avg Latency (ms)** – Average delay in sending messages (producer only). +- **Max Latency (ms)** – Longest observed delay in sending messages (producer only). +- **50th (ms)** – Median latency (half the messages were faster, half slower). +- **95th (ms)** – Latency below which 95% of messages were delivered. +- **99th (ms)** – Latency below which 99% of messages were delivered. +- **99.9th (ms)** – Latency below which 99.9% of messages were delivered. + +## Benchmark summary on Arm64: +Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**. +### Consumer Performance Test +| Metric | Value | Unit | +|-----------------------------|-------------|---------------| +| Total Time Taken | 3.875 | Seconds | +| Data Consumed | 95.3674 | MB | +| Throughput (Data) | 24.6110 | MB/sec | +| Messages Consumed | 1,000,001 | Messages | +| Throughput (Messages) | 258,064.77 | Messages/sec | +| Rebalance Time | 3348 | Milliseconds | +| Fetch Time | 527 | Milliseconds | +| Fetch Throughput (Data) | 180.9629 | MB/sec | +| Fetch Throughput (Messages)| 1,897,535.10| Messages/sec | + +### Producer Performance Test +| Metric | Records Sent | Records/sec | Throughput | Average Latency | Maximum Latency | 50th Percentile Latency | 95th Percentile Latency | 99th Percentile Latency | 99.9th Percentile Latency | +|--------|--------------|-------------|------------|-----------------|-----------------|-------------------------|-------------------------|-------------------------|---------------------------| +| Value | 1,000,000 | 257,532.8 | 24.56 | 816.19 | 1237.00 | 799 | 1168 | 1220 | 1231 | +| Unit | Records | Records/sec | MB/sec | ms | ms | ms | ms | ms | ms | + +## Benchmark summary on x86_64: +Here is a summary of the benchmark results collected on x86_64 **D4s_v6 Ubuntu Pro 24.04 LTS virtual machine**. +### Consumer Performance Test +| Metric | Value | Unit | +|--------------------|-------------|---------------| +| Total Time Taken | 3.811 | Seconds | +| Data Consumed | 95.3674 | MB | +| Throughput (Data) | 25.0243 | MB/sec | +| Messages Consumed | 1,000,001 | Messages | +| Throughput (Messages) | 262,398.58 | Messages/sec | +| Rebalance Time | 3271 | Milliseconds | +| Fetch Time | 540 | Milliseconds | +| Fetch Throughput (Data) | 176.6064 | MB/sec | +| Fetch Throughput (Messages) | 1,851,853.70| Messages/sec | + +### Producer Performance Test +| Metric | Records Sent | Records/sec | Throughput | Average Latency | Maximum Latency | 50th Percentile Latency | 95th Percentile Latency | 99th Percentile Latency | 99.9th Percentile Latency | +|--------|--------------|-------------|------------|-----------------|-----------------|-------------------------|-------------------------|-------------------------|---------------------------| +| Value | 1,000,000 | 242,013.6 | 23.08 | 840.69 | 1351.00 | 832 | 1283 | 1330 | 1350 | +| Unit | Records | Records/sec | MB/sec | ms | ms | ms | ms | ms | ms | + +## Benchmark comparison insights +When comparing the results on Arm64 vs x86_64 virtual machines: + + +- The Kafka **consumer** achieved **25.02 MB/sec throughput**, processing ~**262K messages/sec** with fetch throughput exceeding **1.85M messages/sec**. +- The Kafka **producer** sustained **23.08 MB/sec throughput**, with an average latency of ~**841 ms** and peak latency of ~**1351 ms**. +- These results confirm stable Kafka performance on the **Azure Ubuntu Pro arm64 virtual machine**, validating its suitability for **baseline testing and benchmarking**. + +You have now benchmarked Kafka on an Azure Cobalt 100 Arm64 virtual machine and compared results with x86_64. diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/create-instance.md b/content/learning-paths/servers-and-cloud-computing/kafka-azure/create-instance.md new file mode 100644 index 0000000000..9571395aa2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kafka-azure/create-instance.md @@ -0,0 +1,50 @@ +--- +title: Create an Arm based cloud virtual machine using Microsoft Cobalt 100 CPU +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Introduction + +There are several ways to create an Arm-based Cobalt 100 virtual machine : the Microsoft Azure console, the Azure CLI tool, or using your choice of IaC (Infrastructure as Code). This guide will use the Azure console to create a virtual machine with Arm-based Cobalt 100 Processor. + +This learning path focuses on the general-purpose virtual machine of the D series. Please read the guide on [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series) offered by Microsoft Azure. + +If you have never used the Microsoft Cloud Platform before, please review the microsoft [guide to Create a Linux virtual machine in the Azure portal](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/quick-create-portal?tabs=ubuntu). + +#### Create an Arm-based Azure Virtual Machine + +Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine, launch the Azure portal and navigate to "Virtual Machines". +1. Select "Create", and click on "Virtual Machine" from the drop-down list. +2. Inside the "Basic" tab, fill in the Instance details such as "Virtual machine name" and "Region". +3. Choose the image for your virtual machine (for example, Ubuntu Pro 24.04 LTS) and select “Arm64” as the VM architecture. +4. In the “Size” field, click on “See all sizes” and select the D-Series v6 family of virtual machines. Select “D4ps_v6” from the list. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance.png "Figure 1: Select the D-Series v6 family of virtual machines") + +5. Select "SSH public key" as an Authentication type. Azure will automatically generate an SSH key pair for you and allow you to store it for future use. It is a fast, simple, and secure way to connect to your virtual machine. +6. Fill in the Administrator username for your VM. +7. Select "Generate new key pair", and select "RSA SSH Format" as the SSH Key Type. RSA could offer better security with keys longer than 3072 bits. Give a Key pair name to your SSH key. +8. In the "Inbound port rules", select HTTP (80) and SSH (22) as the inbound ports. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance1.png "Figure 2: Allow inbound port rules") + +9. Click on the "Review + Create" tab and review the configuration for your virtual machine. It should look like the following: + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/ubuntu-pro.png "Figure 3: Review and Create an Azure Cobalt 100 Arm64 VM") + +10. Finally, when you are confident about your selection, click on the "Create" button, and click on the "Download Private key and Create Resources" button. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance4.png "Figure 4: Download Private key and Create Resources") + +11. Your virtual machine should be ready and running within no time. You can SSH into the virtual machine using the private key, along with the Public IP details. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/final-vm.png "Figure 5: VM deployment confirmation in Azure portal") + +{{% notice Note %}} + +To learn more about Arm-based virtual machine in Azure, refer to “Getting Started with Microsoft Azure” in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure). + +{{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/kafka-azure/deploy.md new file mode 100644 index 0000000000..ac9a3ad15c --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kafka-azure/deploy.md @@ -0,0 +1,50 @@ +--- +title: Install Kafka +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Install Kafka on Azure Cobalt 100 + +This section walks you through installing latest version of Apache Kafka on an Ubuntu Pro 24.04 Arm virtual machine. You’ll download Kafka, extract it into `/opt`, configure permissions, and verify the installation by checking the installed version. + +Follow the below instructions to install Kafka on Ubuntu Pro 24.04 virtual machine. + +### Install Java + +Kafka requires Java to run. Install it by executing the following commands: +```console +sudo apt update +sudo apt install -y default-jdk +``` +### Download and Install Kafka + +This sequence of commands downloads Kafka version 4.1.0 to the `/opt` directory, extracts the tarball, renames the folder to kafka for simplicity, and sets ownership so the current user can access and manage the Kafka installation. It prepares the system for running Kafka without permission issues. + +```console +cd /opt +sudo curl -O https://archive.apache.org/dist/kafka/4.1.0/kafka_2.13-4.1.0.tgz +sudo tar -xvzf kafka_2.13-4.1.0.tgz +sudo mv kafka_2.13-4.1.0 kafka +sudo chown -R $USER:$USER kafka +``` +{{% notice Note %}} +Kafka [3.5.0 release announcement](https://kafka.apache.org/blog#apache_kafka_350_release_announcement) includes a significant number of new features and fixes, including improving Kafka Connect and MirrorMaker 2. They aren't Arm-specific, but can benefit all architectures, including Linux/Arm64. +The [Arm Ecosystem Dashboard](https://developer.arm.com/ecosystem-dashboard/) recommends Apache Kafka version 3.5.0 as the minimum recommended on Arm platforms. +{{% /notice %}} + +### Check installed Kafka version + +These commands navigate to the Kafka installation directory and check the installed Kafka version, confirming that Kafka has been successfully installed and is ready for use. +```console +cd /opt/kafka +bin/kafka-topics.sh --version +``` + +You should see an output similar to: +```output +4.1.0 +``` +Kafka installation is complete. You can now proceed with the baseline testing. diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/final-vm.png b/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/final-vm.png new file mode 100644 index 0000000000..5207abfb41 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/final-vm.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/instance.png b/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/instance.png new file mode 100644 index 0000000000..285cd764a5 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/instance.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/instance1.png b/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/instance1.png new file mode 100644 index 0000000000..b9d22c352d Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/instance1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/instance4.png b/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/instance4.png new file mode 100644 index 0000000000..2a0ff1e3b0 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/instance4.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/ubuntu-pro.png b/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/ubuntu-pro.png new file mode 100644 index 0000000000..d54bd75ca6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/kafka-azure/images/ubuntu-pro.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/_index.md b/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/_index.md index 7fc65f7eb4..6f62cd94af 100644 --- a/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/_index.md @@ -1,23 +1,19 @@ --- -title: Autoscaling HTTP applications on Kubernetes +title: Autoscale HTTP applications on Kubernetes with KEDA and Kedify -draft: true -cascade: - draft: true - minutes_to_complete: 45 -who_is_this_for: This is an introductory topic for developers running HTTP-based workloads on Kubernetes who want to enable event-driven autoscaling. +who_is_this_for: This is an introductory topic for developers running HTTP workloads on Kubernetes who want to enable event-driven autoscaling with KEDA and Kedify. learning_objectives: - - Install Kedify (KEDA build, HTTP Scaler, and Kedify Agent) via Helm - - Verify that the components are running in your cluster + - Install Kedify (KEDA build, HTTP Scaler, and Kedify Agent) with Helm + - Verify that Kedify and KEDA components are running in the cluster - Deploy a sample HTTP application and test autoscaling behavior prerequisites: - A running Kubernetes cluster (local or cloud) - - kubectl and helm installed locally - - Access to the Kedify Service dashboard (https://dashboard.kedify.io/) to obtain Organization ID and API Key. You can log in or create an account if you don’t have one + - Kubectl and Helm installed + - Access to the Kedify Service dashboard to obtain your Organization ID and API key (sign up at [Kedify dashboard](https://dashboard.kedify.io/)) author: Zbynek Roubalik diff --git a/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/http-scaling.md b/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/http-scaling.md index 88715908f3..ddda708890 100644 --- a/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/http-scaling.md +++ b/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/http-scaling.md @@ -1,31 +1,36 @@ --- -title: "HTTP Scaling for Ingress-Based Applications" +title: "Autoscale HTTP applications with Kedify and Kubernetes Ingress" weight: 4 layout: "learningpathall" --- +## Overview -In this section, you’ll gain hands-on experience with Kedify HTTP autoscaling. You will deploy a small web service, expose it through a standard Kubernetes Ingress, and rely on Kedify’s autowiring to route traffic via its proxy so requests are measured and drive scaling. +In this section, you’ll gain hands-on experience with Kedify HTTP autoscaling. You will deploy a small web service, expose it through a standard Kubernetes Ingress, and rely on Kedify’s autowiring to route traffic through its proxy so that requests are measured and drive scaling. -You will scale a real HTTP app exposed through Kubernetes Ingress using Kedify’s [kedify-http](https://docs.kedify.io/scalers/http-scaler/) scaler. You will deploy a simple application, enable autoscaling with a [ScaledObject](https://keda.sh/docs/latest/concepts/scaling-deployments/), generate load, and observe the system scale out and back in (including scale-to-zero when idle). +You will scale a real HTTP app exposed through Kubernetes Ingress using [Kedify’s HTTP Scaler](https://docs.kedify.io/scalers/http-scaler/), and then move on to deploy a simple application, enable autoscaling with a scaled object, generate load, and observe the system scale out and back in (including scale-to-zero when idle). + +For more information, see [Scaling Deployments, StatefulSets & Custom Resources](https://keda.sh/docs/latest/concepts/scaling-deployments/) on the KEDA website. ## How it works -With ingress autowiring enabled, Kedify automatically routes traffic through its proxy before it reaches your Service/Deployment: +With ingress autowiring enabled, Kedify automatically routes traffic through its proxy before it reaches your service and deployment: ```output Ingress → kedify-proxy → Service → Deployment ``` -The [Kedify Proxy](https://docs.kedify.io/scalers/http-scaler/#kedify-proxy) gathers request metrics used by the scaler to make decisions. +The [Kedify proxy](https://docs.kedify.io/scalers/http-scaler/#kedify-proxy) gathers request metrics used by the scaler to make decisions. + +## Deployment overview -## Deployment Overview - * Deployment & Service: An HTTP server with a small response delay to simulate work - * Ingress: Public entry point configured using host `application.keda` - * ScaledObject: A Kedify HTTP scaler using `trafficAutowire: ingress` +There are three main components involved in the process: +* For the application deployment and service, there is an HTTP server with a small response delay to simulate work. +* For ingress, there is a public entry point that is configured using the `application.keda` host. +* For the ScaledObject, there is a Kedify HTTP scaler using `trafficAutowire: ingress`. -## Step 1 — Configure the Ingress IP environment variable +## Configure the Ingress IP environment variable -Before testing the application, make sure the INGRESS_IP environment variable is set to your ingress controller’s external IP address or hostname. +Before testing the application, make sure the `INGRESS_IP` environment variable is set to your ingress controller’s external IP address or hostname. If you followed the [Install Ingress Controller](../install-ingress/) guide, you should already have this set. If not, or if you're using an existing ingress controller, run this command: @@ -39,11 +44,9 @@ This will store the correct IP or hostname in the $INGRESS_IP environment variab If your ingress controller service uses a different name or namespace, update the command accordingly. For example, some installations use `nginx-ingress-controller` or place it in a different namespace. {{% /notice %}} -## Step 2 — Deploy the application and configure Ingress - -Now you will deploy a simple HTTP server and expose it using an Ingress resource. The source code for this application is available on [GitHub](https://github.com/kedify/examples/tree/main/samples/http-server). +## Deploy the application and configure Ingress -#### Deploy the application +Now you will deploy a simple HTTP server and expose it using an Ingress resource. The source code for this application is available on the [Kedify GitHub repository](https://github.com/kedify/examples/tree/main/samples/http-server). Run the following command to deploy your application: @@ -116,25 +119,29 @@ spec: EOF ``` -Notes: -- `RESPONSE_DELAY` adds ~300ms latency per request, making scaling effects easier to see. -- The Ingress uses host `application.keda`. To access this app we will use your ingress controller’s IP with a `Host:` header (shown below). +## Key settings explained -#### Verify the application is running correctly +The manifest includes a few key options that affect scaling behavior: -You will now check if you have 1 replica of the application deployed and ready: +- `RESPONSE_DELAY` is set in the Deployment manifest above and adds approximately 300 ms latency per request; this slower response time increases the number of concurrent requests, making scaling effects easier to observe. +- The ingress uses the host `application.keda`. To access this app, use your Ingress controller’s IP with a `Host:` header. + +## Verify the application is running + +Run the following command to check that 1 replica is ready: ```bash kubectl get deployment application ``` -In the output you should see 1 replica ready: +Expected output includes 1 available replica: ```output NAME READY UP-TO-DATE AVAILABLE AGE application 1/1 1 1 3m44s ``` -#### Test the application +## Test the application + Once the application and Ingress are deployed, verify that everything is working correctly by sending a request to the exposed endpoint. Run the following command: ```bash @@ -150,13 +157,14 @@ Content-Length: 301 Connection: keep-alive ``` -## Step 3 — Enable autoscaling with Kedify +## Enable autoscaling with Kedify -The application is now running. Next, you will enable autoscaling so that it can scale dynamically between 0 and 10 replicas. Kedify ensures that no requests are dropped during scaling. Apply the `ScaledObject` by running the following command: +The application is now running. Next, you will enable autoscaling so that it can scale dynamically between 0 and 10 replicas. Kedify ensures that no requests are dropped during scaling. Apply the `ScaledObject` by running the following command: ```bash cat <<'EOF' | kubectl apply -f - apiVersion: keda.sh/v1alpha1 + kind: ScaledObject metadata: name: application @@ -192,23 +200,26 @@ spec: EOF ``` -Key Fields explained: -- `type: kedify-http` — Specifies that Kedify’s HTTP scaler should be used. -- `hosts`, `pathPrefixes` — Define which requests are monitored for scaling decisions. -- `service`, `port` — TIdentify the Kubernetes Service and port that will receive the traffic. -- `scalingMetric: requestRate` and `targetValue: 10` — Scale out when request rate exceeds the target threshold (e.g., 1000 req/s per window, depending on configuration granularity). -- `minReplicaCount: 0` — Enables scale-to-zero when there is no traffic. -- `trafficAutowire: ingress` — Automatically wires your Ingress to the Kedify proxy for seamless traffic management. +## Key fields explained + +Use the following field descriptions to understand how the `ScaledObject` controls HTTP-driven autoscaling and how each setting affects traffic routing and scale decisions: + +- `type: kedify-http` - Uses Kedify’s HTTP scaler. +- `hosts`, `pathPrefixes` - Define which requests are monitored for scaling decisions. +- `service`, `port` - Identify the Kubernetes Service and port that receive traffic. +- `scalingMetric: requestRate`, `granularity: 1s`, `window: 10s`, `targetValue: "10"` - Scales out when the average request rate exceeds ~10 requests/second (rps) per replica over the last 10 seconds. +- `minReplicaCount: 0` - Enables scale to zero when there is no traffic. +- `trafficAutowire: ingress` - Automatically wires your Ingress to the Kedify proxy for seamless traffic management. After applying, the `ScaledObject` will appear in the Kedify dashboard (https://dashboard.kedify.io/). -![Kedify Dashboard With ScaledObject](images/scaledobject.png) +![Kedify dashboard showing the ScaledObject alt-text#center](images/scaledobject.png "Kedify dashboard: ScaledObject") -## Step 4 — Send traffic and observe scaling +## Send traffic and observe scaling Since no traffic is currently being sent to the application, it will eventually scale down to zero replicas. -#### Verify scale to zero +## Verify scale to zero To confirm that the application has scaled down, run the following command and watch until the number of replicas reaches 0: @@ -216,8 +227,8 @@ To confirm that the application has scaled down, run the following command and w watch kubectl get deployment application -n default ``` -You should see similar output: -```bash +You should see output similar to: +```output Every 2,0s: kubectl get deployment application -n default NAME READY UP-TO-DATE AVAILABLE AGE @@ -225,17 +236,16 @@ application 0/0 0 0 110s ``` This continuously monitors the deployment status in the default namespace. Once traffic stops and the idle window has passed, you should see the application deployment report 0/0 replicas, indicating that it has successfully scaled to zero. -#### Verify the app can scale from zero +## Verify the app can scale from zero -Next, test that the application can scale back up from zero when traffic arrives. Send a request to the app: +Send a request to trigger scale-up: ```bash curl -I -H "Host: application.keda" http://$INGRESS_IP ``` -The application should scale from 0 → 1 replica automatically. You should receive an HTTP 200 OK response, confirming that the service is reachable again. -#### Test higher load +The application scales from 0 → 1 replica automatically, and you should receive an HTTP `200 OK` response. Now, generate a heavier, sustained load against the application. You can use `hey` (or a similar benchmarking tool): @@ -264,7 +274,7 @@ Expected behavior: You can also monitor traffic and scaling in the Kedify dashboard: -![Kedify Dashboard ScaledObject Detail](images/load.png) +![Kedify dashboard showing request load and scaling over time alt-text#center](images/load.png "Kedify dashboard: request load and scaling over time") ## Clean up @@ -280,4 +290,4 @@ This will delete the `ScaledObject`, Ingress, Service, and Deployment associated ## Next steps -To go futher, you can explore the Kedify [How-to guides](https://docs.kedify.io/how-to/) for more configurations such as Gateway API, Istio VirtualService, or OpenShift Routes. +To go further, you can explore the [Kedify How-To Guides](https://docs.kedify.io/how-to/) for more configurations such as Gateway API, Istio VirtualService, or OpenShift Routes. diff --git a/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/install-ingress.md b/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/install-ingress.md index 038e07e45d..fa6cd51df3 100644 --- a/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/install-ingress.md +++ b/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/install-ingress.md @@ -1,92 +1,73 @@ --- -title: "Install Ingress Controller" +title: "Install an ingress controller" weight: 3 layout: "learningpathall" --- -Before deploying HTTP applications with Kedify autoscaling, you need an Ingress Controller to handle incoming traffic. Most managed Kubernetes services offered by major cloud providers (AWS EKS, Google GKE, Azure AKS) do not include an Ingress Controller by default. +## Install an ingress controller for HTTP autoscaling on Kubernetes + +Before deploying HTTP applications with Kedify autoscaling, you need an ingress controller to handle incoming traffic. Most managed Kubernetes services (AWS EKS, Google GKE, Azure AKS) do not include an ingress controller by default. In this Learning Path, you install the NGINX Ingress Controller with Helm and target arm64 nodes. {{% notice Note %}} -If your cluster already has an Ingress Controller installed and configured, you can skip this step and proceed directly to the [HTTP Scaling guide](../http-scaling/). +If your cluster already has an ingress controller installed and configured, you can skip this step and proceed to the [Autoscale HTTP applications with Kedify and Kubernetes Ingress section](../http-scaling/). {{% /notice %}} -## Install NGINX Ingress Controller via Helm +## Install the NGINX Ingress Controller with Helm Add the NGINX Ingress Controller Helm repository: - ```bash helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update ``` -Install the NGINX Ingress Controller: - +Install the NGINX Ingress Controller (with `nodeSelector` and `tolerations` for arm64): ```bash -helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ - --namespace ingress-nginx \ - --create-namespace \ - \ - --set "controller.nodeSelector.kubernetes\.io/arch=arm64" \ - --set "controller.tolerations[0].key=kubernetes.io/arch" \ - --set "controller.tolerations[0].operator=Equal" \ - --set "controller.tolerations[0].value=arm64" \ - --set "controller.tolerations[0].effect=NoSchedule" \ - \ - --set "controller.admissionWebhooks.patch.nodeSelector.kubernetes\.io/arch=arm64" \ - --set "controller.admissionWebhooks.patch.tolerations[0].key=kubernetes.io/arch" \ - --set "controller.admissionWebhooks.patch.tolerations[0].operator=Equal" \ - --set "controller.admissionWebhooks.patch.tolerations[0].value=arm64" \ - --set "controller.admissionWebhooks.patch.tolerations[0].effect=NoSchedule" +helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx --namespace ingress-nginx --create-namespace --set "controller.nodeSelector.kubernetes\.io/arch=arm64" --set "controller.tolerations[0].key=kubernetes.io/arch" --set "controller.tolerations[0].operator=Equal" --set "controller.tolerations[0].value=arm64" --set "controller.tolerations[0].effect=NoSchedule" --set "controller.admissionWebhooks.patch.nodeSelector.kubernetes\.io/arch=arm64" --set "controller.admissionWebhooks.patch.tolerations[0].key=kubernetes.io/arch" --set "controller.admissionWebhooks.patch.tolerations[0].operator=Equal" --set "controller.admissionWebhooks.patch.tolerations[0].value=arm64" --set "controller.admissionWebhooks.patch.tolerations[0].effect=NoSchedule" ``` -Wait for the LoadBalancer to be ready: - +Wait for the load balancer to be ready: ```bash -kubectl wait --namespace ingress-nginx \ - --for=condition=ready pod \ - --selector=app.kubernetes.io/component=controller \ - --timeout=300s +kubectl wait --namespace ingress-nginx --for=condition=ready pod --selector=app.kubernetes.io/component=controller --timeout=300s ``` -## Get the External Endpoint +Managed clouds may take a few minutes to allocate a public IP address or hostname. -Get the external IP address or hostname for your ingress controller and save it as an environment variable: +## Get the external endpoint +Retrieve the external IP address or hostname and store it in an environment variable: ```bash -export INGRESS_IP=$(kubectl get service ingress-nginx-controller --namespace=ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[0].ip}{.status.loadBalancer.ingress[0].hostname}') +export INGRESS_IP=$(kubectl get service ingress-nginx-controller --namespace=ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[0].ip}{.status.loadBalancer.ingress[0].hostname}') echo "Ingress IP/Hostname: $INGRESS_IP" ``` -This will save the external IP or hostname in the `INGRESS_IP` environment variable and display it. If the command doesn't print any value, please repeat it after some time. Please note the value: -- **AWS EKS**: You'll see an AWS LoadBalancer hostname (e.g., `a1234567890abcdef-123456789.us-west-2.elb.amazonaws.com`) -- **Google GKE**: You'll see an IP address (e.g., `34.102.136.180`) -- **Azure AKS**: You'll see an IP address (e.g., `20.62.196.123`) +Typical values by provider: +- **AWS EKS**: Load balancer hostname (for example, `a1234567890abcdef-123456789.us-west-2.elb.amazonaws.com`) +- **Google GKE**: IP address (for example, `34.102.136.180`) +- **Azure AKS**: IP address (for example, `20.62.196.123`) -## Configure Access +If no value is printed, wait briefly and re-run the command. -To configure access to the ingress controller, you have two options: +## Configure access -### Option 1: DNS Setup (Recommended for production) -Point `application.keda` to your ingress controller's external IP/hostname using your DNS provider. +You have two options: -### Option 2: Host Header (Quick setup) -Use the external IP/hostname directly with a `Host:` header in your requests. When testing, you will use: +- Option 1: DNS (recommended for production): + create a DNS record pointing `application.keda` to the external IP address or hostname of your ingress controller. -```bash -curl -H "Host: application.keda" http://$INGRESS_IP -``` - -The `$INGRESS_IP` environment variable contains the actual external IP or hostname from your ingress controller service. +- Option 2: host header (quick test): + use the external IP address or hostname directly with a `Host:` header: + ```bash + curl -H "Host: application.keda" http://$INGRESS_IP + ``` + Here, `$INGRESS_IP` expands to the external IP address or hostname of the ingress controller. -## Verification - -Verify that the ingress controller is working by checking its readiness: +## Verify the installation +List the controller pods and confirm they are running: ```bash kubectl get pods --namespace ingress-nginx ``` You should see the `ingress-nginx-controller` pod in `Running` status. - -Now that you have an Ingress Controller installed and configured, proceed to the next section to deploy an application and configure Kedify autoscaling. +Now that you have an ingress controller installed and configured, proceed to the next section to deploy an application and configure Kedify autoscaling. diff --git a/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/install-kedify-helm.md b/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/install-kedify-helm.md index 0d65234874..158f7ba342 100644 --- a/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/install-kedify-helm.md +++ b/content/learning-paths/servers-and-cloud-computing/kedify-http-autoscaling/install-kedify-helm.md @@ -1,55 +1,62 @@ --- -title: "Install Kedify via Helm" +title: "Install Kedify using Helm" weight: 2 layout: "learningpathall" --- -In this section you will learn how to install Kedify on your Kubernetes cluster using Helm. You will add the Kedify chart repo, install KEDA (Kedify build), the HTTP Scaler, and the Kedify Agent, then verify everything is running. +## Overview +In this section, you will install Kedify on your Kubernetes cluster using Helm. You will add the Kedify chart repository, then install three separate Helm charts: KEDA (Kedify build) for event-driven autoscaling, the HTTP Scaler for HTTP-based scaling, and the Kedify Agent for connecting your cluster to Kedify's cloud service. You will then verify the installation. This enables HTTP autoscaling on Kubernetes with KEDA and Kedify, including arm64 nodes. -For more details and all installation methods on Arm, you can refer to the [Kedify installation docs](https://docs.kedify.io/installation/helm#installation-on-arm) +For more information and other installation methods on Arm, see the [Kedify installation documentation](https://docs.kedify.io/installation/helm#installation-on-arm). ## Before you begin You will need: -- A running Kubernetes cluster (e.g., kind, minikube, EKS, GKE, AKS, etc.), hosted on any cloud provider or local environment. -- kubectl and helm installed and configured to communicate with your cluster -- A Kedify Service account (https://dashboard.kedify.io/) to obtain Organization ID and API Key — log in or create an account if you don’t have one +- A running Kubernetes cluster (for example, kind, minikube, EKS, GKE, or AKS), hosted on any cloud provider or local environment +- Kubectl and Helm installed and configured to communicate with your cluster +- A Kedify Service account to obtain your Organization ID and API key (sign up at the [Kedify dashboard](https://dashboard.kedify.io/)) -## Installation +## Gather Kedify credentials -1) Get your Organization ID: In the Kedify dashboard (https://dashboard.kedify.io/) go to Organization -> Details and copy the ID. - -2) Get your API key: -- If you already have a Kedify Agent deployed, you can retrieve it from the existing Secret: +From the Kedify dashboard, copy your Organization ID (**Organization** → **Details**) and retrieve or create an API key. +If you already have a Kedify Agent deployed, decode the key from the existing Secret: ```bash kubectl get secret -n keda kedify-agent -o=jsonpath='{.data.apikey}' | base64 --decode ``` +Otherwise, in the Kedify dashboard go to **Organization** → **API Keys**, select **Create Agent Key**, and copy the key. -Otherwise, in the Kedify dashboard (https://dashboard.kedify.io/) go to Organization -> API Keys, click Create Agent Key, and copy the key. +{{% notice Note %}} +The API key is shared across all agent installations. If you regenerate it, update existing agents and keep it secret. +{{% /notice %}} -Note: The API Key is shared across all your Agent installations. If you regenerate it, update existing Agent installs and keep it secret. +Optionally, export these values for reuse in the following commands: +```bash +export YOUR_ORG_ID="" +export YOUR_API_KEY="" +export CLUSTER_NAME="my-arm-cluster" +``` -## Helm repository +## Add the Kedify Helm repository Add the Kedify Helm repository and update your local index: - ```bash helm repo add kedifykeda https://kedify.github.io/charts helm repo update ``` -## Helm installation +## Install components with Helm -Most providers like AWS EKS and Azure AKS automatically place pods on Arm nodes when you specify `nodeSelector` for `kubernetes.io/arch=arm64`. However, Google Kubernetes Engine (GKE) applies an explicit taint on Arm nodes, requiring matching `tolerations`. +Most providers (such as EKS and AKS) schedule pods on Arm nodes when you specify a `nodeSelector` for `kubernetes.io/arch=arm64`. On Google Kubernetes Engine (GKE), Arm nodes commonly have an explicit taint, so matching `tolerations` are required. To stay portable across providers, configure both `nodeSelector` and `tolerations`. -To ensure a portable deployment strategy across all cloud providers, it is recommended that you configure both `nodeSelector` and `tolerations` in your Helm values or CLI flags. +{{% notice Note %}} +For a portable deployment across cloud providers, configure both `nodeSelector` and `tolerations` in your Helm values or CLI flags. +{{% /notice %}} -Install each component into the keda namespace. Replace placeholders where noted. - -1) Install Kedify build of KEDA: +## Install the Kedify build of KEDA +Run the following Helm command to install the Kedify build of KEDA into the `keda` namespace: ```bash helm upgrade --install keda kedifykeda/keda \ --namespace keda \ @@ -62,8 +69,9 @@ helm upgrade --install keda kedifykeda/keda \ --set "tolerations[0].effect=NoSchedule" ``` -2) Install Kedify HTTP Scaler: +## Install the Kedify HTTP Scaler +Install the Kedify HTTP Scaler with matching node selector and tolerations: ```bash helm upgrade --install keda-add-ons-http kedifykeda/keda-add-ons-http \ --namespace keda \ @@ -80,8 +88,9 @@ helm upgrade --install keda-add-ons-http kedifykeda/keda-add-ons-http \ --set "scaler.tolerations[0].effect=NoSchedule" ``` -3) Install Kedify Agent (edit clusterName, orgId, apiKey): +## Install the Kedify Agent +Edit the cluster name, Organization ID, and API key (or rely on the exported environment variables), then run: ```bash helm upgrade --install kedify-agent kedifykeda/kedify-agent \ --namespace keda \ @@ -95,22 +104,19 @@ helm upgrade --install kedify-agent kedifykeda/kedify-agent \ --set "agent.kedifyProxy.globalValues.tolerations[0].operator=Equal" \ --set "agent.kedifyProxy.globalValues.tolerations[0].value=arm64" \ --set "agent.kedifyProxy.globalValues.tolerations[0].effect=NoSchedule" \ - \ - --set clusterName="my-arm-cluster" \ - --set agent.orgId="$YOUR_ORG_ID" \ - --set agent.apiKey="$YOUR_API_KEY" + --set clusterName="${CLUSTER_NAME:-my-arm-cluster}" \ + --set agent.orgId="${YOUR_ORG_ID}" \ + --set agent.apiKey="${YOUR_API_KEY}" ``` ## Verify installation -You are now ready to verify your installation: - +List pods in the `keda` namespace to confirm all components are running: ```bash kubectl get pods -n keda ``` -Expected output should look like (names may differ): - +Expected output (names might vary): ```output NAME READY STATUS RESTARTS AGE keda-add-ons-http-external-scaler-xxxxx 1/1 Running 0 1m @@ -121,4 +127,4 @@ keda-operator-metrics-apiserver-xxxxx 1/1 Running 0 1m kedify-agent-xxxxx 1/1 Running 0 1m ``` -Proceed to the next section to learn how to install an Ingress controller before deploying a sample HTTP app and testing autoscaling. +Proceed to the next section to install an ingress controller, deploy a sample HTTP app, and test autoscaling. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md index 9477e2b1c3..ffcbe3a19b 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/1_overview.md @@ -14,7 +14,7 @@ Frameworks such as [**llama.cpp**](https://github.com/ggml-org/llama.cpp), provi To analyze their execution and use profiling insights for optimization, you need both a basic understanding of transformer architectures and the right analysis tools. -This Learning Path demonstrates how to use `llama-cli` application from llama.cpp together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs. +This Learning Path demonstrates how to use `llama-cli` from the command line together with Arm Streamline to analyze the efficiency of LLM inference on Arm CPUs. You will learn how to: - Profile token generation at the Prefill and Decode stages @@ -23,4 +23,4 @@ You will learn how to: You will run the `Qwen1_5-0_5b-chat-q4_0.gguf` model using `llama-cli` on Arm Linux and use Streamline for analysis. -The same method can also be applied to Android platforms. +The same method can also be used on Android. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md index 70510e4cea..2c1e4129f0 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro.md @@ -83,4 +83,4 @@ At the Decode stage, by utilizing the [KV cache](https://huggingface.co/blog/not In summary, Prefill is compute-bound, dominated by large GEMM operations and Decode is memory-bound, dominated by KV cache access and GEMV operations. -You will see this highlighted during the analysis with Streamline. \ No newline at end of file +You will see this highlighted during the Streamline performance analysis. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md index 9de51513f7..cdb90f1223 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/3_llama.cpp_annotation.md @@ -20,26 +20,23 @@ You can either build natively on an Arm platform, or cross-compile on another ar ### Step 1: Build Streamline Annotation library -Install [Arm DS](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio) or [Arm Streamline](https://developer.arm.com/Tools%20and%20Software/Streamline%20Performance%20Analyzer) on your development machine first. +Download and install [Arm Performance Studio](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio#Downloads) on your development machine. -Streamline Annotation support code is in the installation directory such as `Arm/Development Studio 2024.1/sw/streamline/gator/annotate`. - -For installation guidance, refer to the [Streamline installation guide](/install-guides/streamline/). - -Clone the gator repository that matches your Streamline version and build the `Annotation support library`. +{{% notice Note %}} +You can also download and install [Arm Development Studio](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio#Downloads), as it also includes Streamline. -The installation step depends on your development machine. +{{% /notice %}} -For Arm native build, you can use the following instructions to install the packages. +Streamline Annotation support code is in the Arm Performance Studio installation directory in the `streamline/gator/annotate` directory. -For other machines, you need to set up the cross compiler environment by installing [Arm GNU toolchain](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads). +Clone the gator repository that matches your Streamline version and build the `Annotation support library`. You can build it on your current machine using the native build instructions and you can cross compile it for another Arm computer using the cross compile instructions. -You can refer to the [GCC install guide](https://learn.arm.com/install-guides/gcc/cross/) for cross-compiler installation. +If you need to set up a cross compiler you can review the [GCC install guide](/install-guides/gcc/cross/). {{< tabpane code=true >}} {{< tab header="Arm Native Build" language="bash">}} - apt-get update - apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git + sudo apt-get update + sudo apt-get install -y ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git cd ~ git clone https://github.com/ARM-software/gator.git cd gator @@ -47,9 +44,9 @@ You can refer to the [GCC install guide](https://learn.arm.com/install-guides/gc cd annotate make {{< /tab >}} - {{< tab header="Cross Compiler" language="bash">}} - apt-get update - apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git + {{< tab header="Cross Compile" language="bash">}} + sudo apt-get update + sudo apt-get install ninja-build cmake gcc g++ g++-aarch64-linux-gnu curl zip unzip tar pkg-config git cd ~ git clone https://github.com/ARM-software/gator.git cd gator @@ -79,7 +76,7 @@ mkdir streamline_annotation cp ~/gator/annotate/libstreamline_annotate.a ~/gator/annotate/streamline_annotate.h streamline_annotation ``` -To link the `libstreamline_annotate.a` library when building llama-cli, add the following lines at the end of `llama.cpp/tools/main/CMakeLists.txt`. +To link the `libstreamline_annotate.a` library when building llama-cli, use an editor to add the following lines at the end of `llama.cpp/tools/main/CMakeLists.txt`. ```makefile set(STREAMLINE_LIB_PATH "${CMAKE_SOURCE_DIR}/streamline_annotation/libstreamline_annotate.a") @@ -87,13 +84,15 @@ target_include_directories(llama-cli PRIVATE "${CMAKE_SOURCE_DIR}/streamline_ann target_link_libraries(llama-cli PRIVATE "${STREAMLINE_LIB_PATH}") ``` -To add Annotation Markers to `llama-cli`, change the `llama-cli` code in `llama.cpp/tools/main/main.cpp` by adding the include file: +To add Annotation Markers to `llama-cli`, edit the file `llama.cpp/tools/main/main.cpp` and make 3 modification. + +First, add the include file at the top of `main.cpp` with the other include files. ```c #include "streamline_annotate.h" ``` -After the call to `common_init()`, add the setup macro: +Next, the find the `common_init()` call in the `main()` function and add the Streamline setup macro below it so that the code looks like: ```c common_init(); @@ -101,7 +100,7 @@ After the call to `common_init()`, add the setup macro: ANNOTATE_SETUP; ``` -Finally, add an annotation marker inside the main loop: +Finally, add an annotation marker inside the main loop. Add the complete code instead the annotation comments so it looks like: ```c for (int i = 0; i < (int) embd.size(); i += params.n_batch) { @@ -150,8 +149,8 @@ Next, configure the project. -DBUILD_SHARED_LIBS=OFF \ -DCMAKE_EXE_LINKER_FLAGS="-static -g" \ -DGGML_OPENMP=OFF \ - -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \ - -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+i8mm -g" \ + -DCMAKE_C_FLAGS="-march=native -g" \ + -DCMAKE_CXX_FLAGS="-march=native -g" \ -DGGML_CPU_KLEIDIAI=ON \ -DLLAMA_BUILD_TESTS=OFF \ -DLLAMA_BUILD_EXAMPLES=ON \ @@ -161,8 +160,8 @@ Next, configure the project. cmake .. \ -DCMAKE_SYSTEM_NAME=Linux \ -DCMAKE_SYSTEM_PROCESSOR=arm \ - -DCMAKE_C_COMPILER=aarch64-none-linux-gnu-gcc \ - -DCMAKE_CXX_COMPILER=aarch64-none-linux-gnu-g++ \ + -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \ + -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \ -DLLAMA_NATIVE=OFF \ -DLLAMA_F16C=OFF \ -DLLAMA_GEMM_ARM=ON \ @@ -190,7 +189,7 @@ Now you can build the project using `cmake`: ```bash cd ~/llama.cpp/build -cmake --build ./ --config Release +cmake --build ./ --config Release -j $(nproc) ``` After the building process completes, you can find the `llama-cli` in the `~/llama.cpp/build/bin/` directory. diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md index c413ac15b6..00472c5863 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/4_analyze_token_prefill_decode.md @@ -8,24 +8,25 @@ layout: learningpathall ## Run llama-cli and analyze the data with Streamline -After successfully building llama-cli, the next step is to set up the runtime environment on your Arm platform. +After successfully building llama-cli, the next step is to set up the runtime environment on your Arm platform. This can be your development machine or another Arm system. -### Set up gatord +### Set up the gator daemon -The gator daemon (gatord) is the Streamline collection agent that runs on the target device. It captures performance data including CPU metrics, PMU events, and annotations, then sends this data to the Streamline analysis tool running on your host machine. The daemon needs to be running on your target device before you can capture performance data. +The gator daemon, `gatord`, is the Streamline collection agent that runs on the target device. It captures performance data including CPU metrics, PMU events, and annotations, then sends this data to the Streamline analysis tool running on your host machine. The daemon needs to be running on your target device before you can capture performance data. Depending on how you built llama.cpp: For the cross-compiled build flow: - Copy the `llama-cli` executable to your Arm target. - - Also copy the `gatord` binary from the Arm DS or Streamline installation: - - Linux: `Arm\Development Studio 2024.1\sw\streamline\bin\linux\arm64` - - Android: `Arm\Development Studio 2024.1\sw\streamline\bin\android\arm64` + - Copy the `gatord` binary from the Arm Performance Studio release. If you are targeting Linux, take it from `streamline\bin\linux\arm64` and if you are targeting Android take it from `streamline\bin\android\arm64`. + +Put both of these programs in your home directory on the target system. For the native build flow: + - Use the `llama-cli` from your local build in `llama.cpp/build/bin` and the `gatord` you compiled earlier at `~/gator/build-native-gcc-rel/gatord`. - - Use the `llama-cli` from your local build and the `gatord` you compiled earlier (`~/gator/build-native-gcc-rel/gatord`). +You now have the `gatord` and the `llama-cli` on the computer you want to run and profile. ### Download a lightweight model @@ -49,8 +50,9 @@ Start the gator daemon on your Arm target: You should see similar messages to those shown below: ``` bash -Streamline Data Recorder v9.4.0 (Build 9b1e8f8) -Copyright (c) 2010-2024 Arm Limited. All rights reserved. +Streamline Data Recorder v9.6.0 (Build oss) +Copyright (c) 2010-2025 Arm Limited. All rights reserved. + Gator ready ``` diff --git a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md index 11ebed8219..1f97f58d6e 100644 --- a/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/_index.md @@ -19,8 +19,8 @@ learning_objectives: prerequisites: - Basic understanding of llama.cpp - Understanding of transformer models - - Knowledge of Streamline usage - - An Arm Neoverse or Cortex-A hardware platform running Linux or Android to test the application + - Knowledge of Arm Streamline usage + - An Arm Neoverse or Cortex-A hardware platform running Linux or Android author: - Zenon Zhilong Xiu diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/_index.md index f609621253..236fdb3299 100644 --- a/content/learning-paths/servers-and-cloud-computing/mysql-azure/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/_index.md @@ -5,14 +5,14 @@ draft: true cascade: draft: true -minutes_to_complete: 40 +minutes_to_complete: 30 -who_is_this_for: This is an advanced topic that introduces MySQL deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating MySQL applications from x86_64 to Arm. +who_is_this_for: This is an introductory topic that introduces MySQL deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating MySQL applications from x86_64 to Arm. learning_objectives: - Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image. - Deploy MySQL on the Ubuntu virtual machine. - - Perform MySQL baseline testing and benchmarking on both x86_64 and Arm64 virtual machines. + - Perform MySQL baseline testing and benchmarking on Arm64 virtual machines. prerequisites: - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6) @@ -21,7 +21,7 @@ prerequisites: author: Pareena Verma ### Tags -skilllevels: Advanced +skilllevels: Introductory subjects: Databases cloud_service_providers: Microsoft Azure diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/baseline.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/baseline.md index 401a3cd315..7d160aea17 100644 --- a/content/learning-paths/servers-and-cloud-computing/mysql-azure/baseline.md +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/baseline.md @@ -6,13 +6,11 @@ weight: 6 layout: learningpathall --- -## Run a functional test of MySQL on Azure Cobalt 100 - -After installing MySQL on your Arm64 virtual machine, you can perform simple baseline testing to validate that MySQL runs correctly and produces the expected output. +After installing MySQL on your Azure Cobalt 100 Arm64 virtual machine, you should run a functional test to confirm that the database is operational and ready for use. Beyond just checking service status, validation ensures MySQL is processing queries correctly, users can authenticate, and the environment is correctly configured for cloud workloads. ### Start MySQL -Make sure MySQL is running: +Ensure MySQL is running and configured to start on boot: ```console sudo systemctl start mysql @@ -20,12 +18,17 @@ sudo systemctl enable mysql ``` ### Connect to MySQL +Connect using the MySQL client: + ```console mysql -u admin -p ``` -Opens the MySQL client and connects as the new user(admin), prompting you to enter the admin password. +This opens the MySQL client and connects as the new user(admin), prompting you to enter the admin password. + +### Show and Use Database -### Show and use Database +Once you’ve connected successfully with your new user, the next step is to create and interact with a database. This verifies that your MySQL instance is not only accessible but also capable of storing and organizing data. +Run the following commands inside the MySQL shell: ```sql CREATE DATABASE baseline_test; @@ -69,10 +72,13 @@ mysql> SELECT DATABASE(); +---------------+ 1 row in set (0.00 sec) ``` -You created a new database named **baseline_test**, verified its presence with `SHOW DATABASES`, and confirmed it is the active database using `SELECT DATABASE()`. +You created a new database named `baseline_test`, verified its presence with `SHOW DATABASES`, and confirmed it is the active database using `SELECT DATABASE()`. ### Create and show Table +After creating and selecting a database, the next step is to define a table, which represents how your data will be structured. In MySQL, tables are the core storage objects where data is inserted, queried, and updated. +Run the following inside the `baseline_test` database: + ```sql CREATE TABLE test_table ( id INT AUTO_INCREMENT PRIMARY KEY, @@ -100,10 +106,13 @@ mysql> SHOW TABLES; +-------------------------+ 1 row in set (0.00 sec) ``` -You successfully created the table **test_table** in the `baseline_test` database and verified its existence using `SHOW TABLES`. +You successfully created the table `test_table` in the `baseline_test` database and verified its existence using `SHOW TABLES`. ### Insert Sample Data +Once the table is created, you can populate it with sample rows. This validates that MySQL can handle write operations and that the underlying storage engine is working properly. + +Run the following SQL command inside the baseline_test database: ```sql INSERT INTO test_table (name, value) VALUES @@ -114,7 +123,7 @@ VALUES - `INSERT INTO test_table (name, value)` - Specifies which table and columns to insert into. - `VALUES` - Provides three rows of data. -After inserting, you can check the data with: +After inserting data into `test_table`, you can confirm the write operation succeeded by retrieving the rows with: ```sql SELECT * FROM test_table; @@ -135,6 +144,6 @@ mysql> SELECT * FROM test_table; +----+---------+-------+ 3 rows in set (0.00 sec) ``` +This confirms that that rows were successfully inserted, the auto-increment primary key (id) is working correctly and the query engine can read back from disk/memory and return results instantly. -The functional test was successful — the **test_table** contains three rows (**Alice, Bob, and Charlie**) with their respective values, confirming MySQL is working -correctly. +The functional test was successful. The test_table contains the expected three rows (Alice, Bob, and Charlie) with their respective values. This confirms that MySQL is working correctly on your Cobalt 100 Arm-based VM, completing the installation and validation phase. diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/benchmarking.md index eb9cc1b80c..c9d5c5c949 100644 --- a/content/learning-paths/servers-and-cloud-computing/mysql-azure/benchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/benchmarking.md @@ -6,29 +6,34 @@ weight: 7 layout: learningpathall --- -## Benchmark MySQL on Azure Cobalt 100 Arm-based instances and x86_64 instances +## Benchmark MySQL on Azure Cobalt 100 Arm-based instances -`mysqlslap` is the official MySQL benchmarking tool used to simulate multiple client connections and measure query performance. It helps evaluate **read/write throughput, query response times**, and overall MySQL server performance under different workloads, making it ideal for baseline testing and optimization. +To understand how MySQL performs on Azure Cobalt 100 (Arm64) VMs, you can use the built-in `mysqlslap` tool. + +`mysqlslap` is the official MySQL benchmarking tool used to simulate multiple client connections and measure query performance. It helps evaluate read/write throughput, query response times, and overall MySQL server performance under different workloads, making it ideal for baseline testing and optimization. ## Steps for MySQL Benchmarking with mysqlslap 1. Connect to MySQL and Create a Database -To access the MySQL server, use the following command based on your `admin` user password: +Before running `mysqlslap`, you will create a dedicated test database so that benchmarking doesn’t interfere with your application data. This ensures clean test results and avoids accidental modifications to production schemas. +Connect to MySQL using the admin user: ```console mysql -u admin -p ``` -Once logged in, you can create a benchmark_db using SQL commands like: +Once logged in, create a benchmarking database: ```sql CREATE DATABASE benchmark_db; USE benchmark_db; ``` -3. Create a Table and Populate Data +2. Create a Table and Populate Data + +With a dedicated `benchmark_db` created, the next step is to define a test table and populate it with data. This simulates a realistic workload so that `mysqlslap` can measure query performance against non-trivial datasets. -After logging into MySQL, you can create a table to store benchmark data. Here’s a simple example: +Create a benchmark table: ```sql CREATE TABLE benchmark_table ( @@ -37,14 +42,18 @@ CREATE TABLE benchmark_table ( score INT ); ``` -Insert some sample rows manually: +Insert Sample Rows Manually: +For quick validation: ```sql INSERT INTO benchmark_table (username,score) VALUES ('John',100),('Jane',200),('Mike',300); ``` +This verifies that inserts work correctly and allows you to test small queries. -Or populate automatically with 1000 rows: +Populate Automatically with 1000 Rows + +For benchmarking, larger datasets give more meaningful results. You can use a stored procedure to generate rows programmatically: ```sql DELIMITER // @@ -62,15 +71,14 @@ DELIMITER ; CALL populate_benchmark_data(); DROP PROCEDURE populate_benchmark_data; ``` -- The table `benchmark_table` has three columns: `record_id` (primary key), `username`, and `score`. -- You can insert a few rows manually for testing or use a procedure to generate **1000 rows automatically** for more realistic benchmarking +At this stage, you have a populated `benchmark_table` inside `benchmark_db`. This provides a realistic dataset for running `mysqlslap`, enabling you to measure how MySQL performs on Azure Cobalt 100 under read-heavy, write-heavy, or mixed workloads. ## Run a Simple Read/Write Benchmark -Once your table is ready, you can use `mysqlslap` to simulate multiple clients performing queries. This helps test MySQL’s performance under load. +With the `benchmark_table` populated, you can run a synthetic workload using mysqlslap to simulate multiple clients performing inserts or queries at the same time. This tests how well MySQL handles concurrent connections and query execution. ```console -mysqlslap --user=admin --password="MyStrongPassword!" --host=127.0.0.1 --concurrency=10 --iterations=5 --query="INSERT INTO benchmark_db.benchmark_table (username,score) VALUES('TestUser',123);" --create-schema=benchmark_db +mysqlslap --user=admin --password=`MyStrongPassword!` --host=127.0.0.1 --concurrency=10 --iterations=5 --query="INSERT INTO benchmark_db.benchmark_table (username,score) VALUES('TestUser',123);" --create-schema=benchmark_db ``` - **--user / --password:** MySQL login credentials. - **--host:** MySQL server address (127.0.0.1 for local). @@ -79,7 +87,7 @@ mysqlslap --user=admin --password="MyStrongPassword!" --host=127.0.0.1 - - **--query:** The SQL statement to run repeatedly. - **--create-schema:** The database in which to run the query. -You should see output similar to the following: +You should see output similar to: ```output Benchmark @@ -90,13 +98,15 @@ Benchmark Average number of queries per client: 1 ``` -Below command runs a **read benchmark** on your MySQL database using `mysqlslap`. It simulates multiple clients querying the table at the same time and records the results. +Run a Read Benchmark (table scan): + +You can now run a test that simulates multiple clients querying the table at the same time and records the results: ```console mysqlslap --user=admin --password="MyStrongPassword!" --host=127.0.0.1 --concurrency=10 --iterations=5 --query="SELECT * FROM benchmark_db.benchmark_table WHERE record_id < 500;" --create-schema=benchmark_db --verbose | tee -a /tmp/mysqlslap_benchmark.log ``` -You should see output similar to the following: +You should see output similar to: ```output Benchmark @@ -109,11 +119,11 @@ Benchmark ## Benchmark Results Table Explained: -- **Average number of seconds to run all queries:** This is the average time it took for all the queries in one iteration to complete across all clients. It gives you a quick sense of overall performance. -- **Minimum number of seconds to run all queries:** This is the fastest time any iteration of queries took. -- **Maximum number of seconds to run all queries:** This is the slowest time any iteration of queries took. The closer this is to the average, the more consistent your performance is. -- **Number of clients running queries:** Indicates how many simulated users (or connections) ran queries simultaneously during the test. -- **Average number of queries per client:** Shows the average number of queries each client executed in the benchmark iteration. + Average number of seconds to run all queries: This is the average time it took for all the queries in one iteration to complete across all clients. It gives you a quick sense of overall performance. + Minimum number of seconds to run all queries: This is the fastest time any iteration of queries took. + Maximum number of seconds to run all queries: This is the slowest time any iteration of queries took. The closer this is to the average, the more consistent your performance is. + Number of clients running queries: Indicates how many simulated users (or connections) ran queries simultaneously during the test. + Average number of queries per client: Shows the average number of queries each client executed in the benchmark iteration. ## Benchmark summary on Arm64: Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**. @@ -123,21 +133,13 @@ Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pr | INSERT | 0.267 | 0.265 | 0.271 | 10 | 1 | | SELECT | 0.263 | 0.261 | 0.264 | 10 | 1 | -## Benchmark summary on x86_64: -Here is a summary of the benchmark results collected on x86_64 **D4s_v6 Ubuntu Pro 24.04 LTS virtual machine**. - -| Query Type | Average Time (s) | Minimum Time (s) | Maximum Time (s) | Clients | Avg Queries per Client | -|------------|-----------------|-----------------|-----------------|--------|----------------------| -| INSERT | 0.243 | 0.231 | 0.273 | 10 | 1 | -| SELECT | 0.222 | 0.214 | 0.233 | 10 | 1 | ## Insights from Benchmark Results The benchmark results on the Arm64 virtual machine show: -- **Balanced Performance for Read and Write Queries:** Both `INSERT` and `SELECT` queries performed consistently, with average times of **0.267s** and **0.263s**, respectively. -- **Low Variability Across Iterations:** The difference between the minimum and maximum times was very small for both query types, indicating stable and predictable behavior under load. -- **Moderate Workload Handling:** With **10 clients** and an average of **1 query per client**, the system handled concurrent operations efficiently without significant delays. -- **Key Takeaway:** The MySQL setup on Arm64 provides reliable and steady performance for both data insertion and retrieval tasks, making it a solid choice for applications requiring dependable database operations. - -You have now benchmarked MySql on an Azure Cobalt 100 Arm64 virtual machine and compared results with x86_64. + Balanced Performance for Read and Write Queries: Both `INSERT` and `SELECT` queries performed consistently, with average times of 0.267s and 0.263s, respectively. + Low Variability Across Iterations: The difference between the minimum and maximum times was very small for both query types, indicating stable and predictable behavior under load. + Moderate Workload Handling: With 10 clients and an average of 1 query per client, the system handled concurrent operations efficiently without significant delays. + +This demonstrates that the MySQL setup on Arm64 provides reliable and steady performance for both data insertion and retrieval tasks, making it a solid choice for applications requiring dependable database operations. diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/create-instance.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/create-instance.md index 9571395aa2..83bee32945 100644 --- a/content/learning-paths/servers-and-cloud-computing/mysql-azure/create-instance.md +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/create-instance.md @@ -1,18 +1,24 @@ --- -title: Create an Arm based cloud virtual machine using Microsoft Cobalt 100 CPU +title: Create an Azure Cobalt 100 Arm64 virtual machine weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Introduction +## Prerequisites and setup -There are several ways to create an Arm-based Cobalt 100 virtual machine : the Microsoft Azure console, the Azure CLI tool, or using your choice of IaC (Infrastructure as Code). This guide will use the Azure console to create a virtual machine with Arm-based Cobalt 100 Processor. +There are several common ways to create an Arm-based Cobalt 100 virtual machine, and you can choose the method that best fits your workflow or requirements: -This learning path focuses on the general-purpose virtual machine of the D series. Please read the guide on [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series) offered by Microsoft Azure. +- The Azure Portal +- The Azure CLI +- An infrastructure as code (IaC) tool -If you have never used the Microsoft Cloud Platform before, please review the microsoft [guide to Create a Linux virtual machine in the Azure portal](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/quick-create-portal?tabs=ubuntu). +In this section, you will launch the Azure Portal to create a virtual machine with the Arm-based Azure Cobalt 100 processor. + +This Learning Path focuses on general-purpose virtual machines in the Dpsv6 series. For more information, see the [Microsoft Azure guide for the Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series). + +While the steps to create this instance are included here for convenience, you can also refer to the [Deploy a Cobalt 100 virtual machine on Azure Learning Path](/learning-paths/servers-and-cloud-computing/cobalt/). #### Create an Arm-based Azure Virtual Machine diff --git a/content/learning-paths/servers-and-cloud-computing/mysql-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/mysql-azure/deploy.md index 7e30f55f3e..82236065c5 100644 --- a/content/learning-paths/servers-and-cloud-computing/mysql-azure/deploy.md +++ b/content/learning-paths/servers-and-cloud-computing/mysql-azure/deploy.md @@ -8,14 +8,14 @@ layout: learningpathall ## Install MySQL on Azure Cobalt 100 -This section walks you through installing and securing MySQL on an Azure Arm64 virtual machine. You will set up the database, configure access, and verify it’s running—ready for development and testing. +This section demonstrates how to install and secure MySQL on an Azure Arm64 virtual machine. You will configure the database, set up security measures, and verify that the service is running properly, making the environment ready for development, testing, or production deployment. -Start by installing MySQL and other essential tools: +## Install MySQL and Tools -## Install MySQL and tools +Before installing MySQL, it’s important to ensure your VM is updated so you have the latest Arm64-optimized libraries and security patches. Ubuntu and other modern Linux distributions maintain Arm-native MySQL packages, so installation is straightforward with the system package manager. 1. Update the system and install MySQL -You update your system's package lists to ensure you get the latest versions and then install the MySQL server using the package manager. +Update your system's package lists to ensure you get the latest versions and then install the MySQL server using the package manager. ```console sudo apt update @@ -24,12 +24,13 @@ sudo apt install -y mysql-server 2. Secure MySQL installation -After installing MySQL, You are locking down your database so only you can access it safely. It’s like setting up a password and cleaning up unused accounts to make sure no one else can mess with your data. +Once MySQL is installed, the default configuration is functional but not secure. +You will lock down your database so only you can access it safely. This involves setting up a password and cleaning up unused accounts to make sure no one else can access your data. ```console sudo mysql_secure_installation ``` -Follow the prompts: +This interactive script walks you through several critical security steps. Follow the prompts: - Set a strong password for root. - Remove anonymous users. @@ -37,40 +38,56 @@ Follow the prompts: - Remove test databases. - Reload privilege tables. +After securing your MySQL installation, the database is significantly harder to compromise. + 3. Start and enable MySQL service -You are turning on the database so it starts working and making sure it stays on every time you turn on your computer.: +After installation and securing MySQL, the next step is to ensure that the MySQL server process (mysqld) is running and configured to start automatically whenever your VM boots. ```console sudo systemctl start mysql sudo systemctl enable mysql ``` -Check the status: +Verify MySQL Status: ```console sudo systemctl status mysql ``` -You should see `active (running)`. +You should see output similar to: + +```output +mysql.service - MySQL Community Server + Loaded: loaded (/usr/lib/systemd/system/mysql.service; enabled; preset: enabled) + Active: active (running) since Tue 2025-09-30 20:31:48 UTC; 1min 53s ago + Main PID: 3255 (mysqld) + Status: "Server is operational" + Tasks: 39 (limit: 19099) + Memory: 366.4M (peak: 380.2M) + CPU: 952ms + CGroup: /system.slice/mysql.service + └─3255 /usr/sbin/mysqld +``` +You should see `active (running)` in the output, which indicates that MySQL is up and running. 4. Verify MySQL version -You check the installed version of MySQL to confirm it’s set up correctly and is running. +You check also check the installed version of MySQL to confirm it’s set up correctly and is running. ```console mysql -V ``` -You should see output similar to the following: +You should see output similar to: ```output mysql Ver 8.0.43-0ubuntu0.24.04.1 for Linux on aarch64 ((Ubuntu)) ``` 5. Access MySQL shell -You log in to the MySQL interface using the root user to interact with the database and perform administrative tasks: +After confirming that MySQL is running, the next step is to log in to the MySQL monitor (shell). This is the command-line interface used to interact with the database server for administrative tasks such as creating users, managing databases, and tuning configurations. ``` sudo mysql ``` -You should see output similar to the following: +You should see output similar to: ```output Welcome to the MySQL monitor. Commands end with ; or \g. @@ -87,38 +104,41 @@ Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> ``` +The `mysql> prompt` indicates you are now in the MySQL interactive shell and can issue SQL commands. 6. Create a new user -You are setting up a new area to store your data and giving someone special permissions to use it. This helps you organize your work better and control who can access it: +While the root account gives you full control, it’s best practice to avoid using it for day-to-day database operations. Instead, you should create separate users with specific privileges. +Start by entering the MySQL shell: ```console sudo mysql ``` -Inside the MySQL shell, run: +Inside the shell, create a new user: ```sql CREATE USER 'admin'@'localhost' IDENTIFIED BY 'MyStrongPassword!'; GRANT ALL PRIVILEGES ON *.* TO 'admin'@'localhost' WITH GRANT OPTION; FLUSH PRIVILEGES; -; EXIT; ``` -- Replace **MyStrongPassword!** with the password you want. -- This reloads the privilege tables so your new password takes effect immediately. +Replace `MyStrongPassword!` with a strong password of your choice. +`FLUSH PRIVILEGES;` Reloads the in-memory privilege tables from disk, applying changes immediately. ## Verify Access with New User -You test logging into MySQL using the new user account to ensure it works and has the proper permissions. In my case new user is `admin`. +Once you’ve created a new MySQL user, it’s critical to test login and confirm that the account works as expected. This ensures the account is properly configured and can authenticate against the MySQL server. + +Run the following command ( for user `admin`): ```console mysql -u admin -p ``` -- Enter your current `admin` password. +You will then be asked to enter the password you created in the previous step. -You should see output similar to the following: +You should see output similar to: ```output Enter password: @@ -136,4 +156,4 @@ Type 'help;' or '\h' for help. Type '\c' to clear the current input statement mysql> exit ``` -MySQL installation is complete. You can now proceed with the baseline testing of MySQL in the next section +With this, the MySQL installation is complete. You can now proceed with baseline testing of MySQL in the next section. diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/_index.md index 4d901c770b..da8b5b5949 100644 --- a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/_index.md @@ -7,22 +7,22 @@ cascade: minutes_to_complete: 60 -who_is_this_for: This Learning Path introduces ONNX deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating ONNX-based applications from x86_64 to Arm with minimal or no changes. +who_is_this_for: This Learning Path introduces ONNX deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers deploying ONNX-based applications on Arm-based machines. learning_objectives: - Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image. - Deploy ONNX on the Ubuntu Pro virtual machine. - - Perform ONNX baseline testing and benchmarking on both x86_64 and Arm64 virtual machines. + - Perform ONNX baseline testing and benchmarking on Arm64 virtual machines. prerequisites: - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6). - Basic understanding of Python and machine learning concepts. - Familiarity with [ONNX Runtime](https://onnxruntime.ai/docs/) and Azure cloud services. -author: Jason Andrews +author: Pareena Verma ### Tags -skilllevels: Advanced +skilllevels: Introductory subjects: ML cloud_service_providers: Microsoft Azure diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/background.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/background.md index 03ff40cd59..1aef38fe2f 100644 --- a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/background.md +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/background.md @@ -6,7 +6,7 @@ weight: 2 layout: "learningpathall" --- -## Cobalt 100 Arm-based processor +## Azure Cobalt 100 Arm-based processor Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance. @@ -16,6 +16,6 @@ To learn more about Cobalt 100, refer to the blog [Announcing the preview of new ONNX (Open Neural Network Exchange) is an open-source format designed for representing machine learning models. It provides interoperability between different deep learning frameworks, enabling models trained in one framework (such as PyTorch or TensorFlow) to be deployed and run in another. -ONNX models are serialized into a standardized format that can be executed by the **ONNX Runtime**, a high-performance inference engine optimized for CPU, GPU, and specialized hardware accelerators. This separation of model training and inference allows developers to build flexible, portable, and production-ready AI workflows. +ONNX models are serialized into a standardized format that can be executed by the ONNX Runtime, a high-performance inference engine optimized for CPU, GPU, and specialized hardware accelerators. This separation of model training and inference allows developers to build flexible, portable, and production-ready AI workflows. ONNX is widely used in cloud, edge, and mobile environments to deliver efficient and scalable inference for deep learning models. Learn more from the [ONNX official website](https://onnx.ai/) and the [ONNX Runtime documentation](https://onnxruntime.ai/docs/). diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/baseline.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/baseline.md index 3e7ed69a1c..08f727beb4 100644 --- a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/baseline.md +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/baseline.md @@ -7,12 +7,11 @@ layout: learningpathall --- -## Baseline testing using ONNX Runtime: +## Baseline Testing using ONNX Runtime: -This test measures the inference latency of the ONNX Runtime by timing how long it takes to process a single input using the `squeezenet-int8.onnx model`. It helps evaluate how efficiently the model runs on the target hardware. - -Create a **baseline.py** file with the below code for baseline test of ONNX: +The purpose of this test is to measure the inference latency of ONNX Runtime on your Azure Cobalt 100 VM. By timing how long it takes to process a single input through the SqueezeNet INT8 model, you can validate that ONNX Runtime is functioning correctly and get a baseline performance measurement for your target hardware. +Create a file named `baseline.py` with the following code: ```python import onnxruntime as ort import numpy as np @@ -29,12 +28,12 @@ end = time.time() print("Inference time:", end - start) ``` -Run the baseline test: +Run the baseline script to measure inference time: ```console python3 baseline.py ``` -You should see an output similar to: +You should see output similar to: ```output Inference time: 0.0026061534881591797 ``` @@ -45,8 +44,11 @@ input tensor of shape (1, 3, 224, 224): - 224 x 224: image resolution (common for models like SqueezeNet) {{% /notice %}} +This indicates the model successfully executed a single forward pass through the SqueezeNet INT8 ONNX model and returned results. + #### Output summary: -- Single inference latency: ~2.60 milliseconds (0.00260 sec) -- This shows the initial (cold-start) inference performance of ONNX Runtime on CPU using an optimized int8 quantized model. -- This demonstrates that the setup is fully working, and ONNX Runtime efficiently executes quantized models on Arm64. +Single inference latency(0.00260 sec): This is the time required for the model to process one input image and produce an output. The first run includes graph loading, memory allocation, and model initialization overhead. +Subsequent inferences are usually faster due to caching and optimized execution. + +This demonstrates that the setup is fully working, and ONNX Runtime efficiently executes quantized models on Arm64. diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/benchmarking.md index 56d54578ae..d3a18d7050 100644 --- a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/benchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/benchmarking.md @@ -6,59 +6,63 @@ weight: 6 layout: learningpathall --- -Now that you’ve set up and run the ONNX model (e.g., SqueezeNet), you can use it to benchmark inference performance using Python-based timing or tools like **onnxruntime_perf_test**. This helps evaluate the ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances. - -You can also compare the inference time between Cobalt 100 (Arm64) and similar D-series x86_64-based virtual machine on Azure. +Now that you have validated ONNX Runtime with Python-based timing (e.g., SqueezeNet baseline test), you can move to using a dedicated benchmarking utility called `onnxruntime_perf_test`. This tool is designed for systematic performance evaluation of ONNX models, allowing you to capture more detailed statistics than simple Python timing. +This helps evaluate the ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances and other x86_64 instances. architectures. ## Run the performance tests using onnxruntime_perf_test -The **onnxruntime_perf_test** is a performance benchmarking tool included in the ONNX Runtime source code. It is used to measure the inference performance of ONNX models under various runtime conditions (like CPU, GPU, or other execution providers). +The `onnxruntime_perf_test` is a performance benchmarking tool included in the ONNX Runtime source code. It is used to measure the inference performance of ONNX models and supports multiple execution providers (like CPU, GPU, or other execution providers). on Arm64 VMs, CPU execution is the focus. ### Install Required Build Tools +Before building or running `onnxruntime_perf_test`, you will need to install a set of development tools and libraries. These packages are required for compiling ONNX Runtime and handling model serialization via Protocol Buffers. ```console sudo apt update sudo apt install -y build-essential cmake git unzip pkg-config sudo apt install -y protobuf-compiler libprotobuf-dev libprotoc-dev git ``` -Then verify: +Then verify protobuf installation: ```console protoc --version ``` -You should see an output similar to: +You should see output similar to: ```output libprotoc 3.21.12 ``` ### Build ONNX Runtime from Source: -The benchmarking tool, **onnxruntime_perf_test**, isn’t available as a pre-built binary artifact for any platform. So, you have to build it from the source, which is expected to take around 40-50 minutes. +The benchmarking tool `onnxruntime_perf_test`, isn’t available as a pre-built binary for any platform. So, you will have to build it from the source, which is expected to take around 40 minutes. -Clone onnxruntime: +Clone onnxruntime repo: ```console git clone --recursive https://github.com/microsoft/onnxruntime cd onnxruntime ``` -Now, build the benchmark as below: +Now, build the benchmark tool: ```console ./build.sh --config Release --build_dir build/Linux --build_shared_lib --parallel --build --update --skip_tests ``` -This will build the benchmark tool inside ./build/Linux/Release/onnxruntime_perf_test. +You should see the executable at: +```output +./build/Linux/Release/onnxruntime_perf_test +``` ### Run the benchmark -Now that the benchmarking tool has been built, you can benchmark the **squeezenet-int8.onnx** model, as below: +Now that you have built the benchmarking tool, you can run inference benchmarks on the SqueezeNet INT8 model: ```console -./build/Linux/Release/onnxruntime_perf_test -e cpu -r 100 -m times -s -Z -I +./build/Linux/Release/onnxruntime_perf_test -e cpu -r 100 -m times -s -Z -I ../squeezenet-int8.onnx ``` -- **e cpu**: Use the CPU execution provider (not GPU or any other backend). -- **r 100**: Run 100 inferences. -- **m times**: Use "repeat N times" mode. -- **s**: Show detailed statistics. -- **Z**: Disable intra-op thread spinning (reduces CPU usage when idle between runs). -- **I**: Input the ONNX model path without using input/output test data. +Breakdown of the flags: + -e cpu → Use the CPU execution provider. + -r 100 → Run 100 inference passes for statistical reliability. + -m times → Run in “repeat N times” mode. Useful for latency-focused measurement. + -s → Show detailed per-run statistics (latency distribution). + -Z → Disable intra-op thread spinning. Reduces CPU waste when idle between runs, especially on high-core systems like Cobalt 100. + -I → Input the ONNX model path directly, skipping pre-generated test data. -You should see an output similar to: +You should see output similar to: ```output Disabling intra-op thread spinning between runs @@ -84,12 +88,12 @@ P999 Latency: 0.00190312 s ``` ### Benchmark Metrics Explained -- **Average Inference Time**: The mean time taken to process a single inference request across all runs. Lower values indicate faster model execution. -- **Throughput**: The number of inference requests processed per second. Higher throughput reflects the model’s ability to handle larger workloads efficiently. -- **CPU Utilization**: The percentage of CPU resources used during inference. A value close to 100% indicates full CPU usage, which is expected during performance benchmarking. -- **Peak Memory Usage**: The maximum amount of system memory (RAM) consumed during inference. Lower memory usage is beneficial for resource-constrained environments. -- **P50 Latency (Median Latency)**: The time below which 50% of inference requests complete. Represents typical latency under normal load. -- **Latency Consistency**: Describes the stability of latency values across all runs. "Consistent" indicates predictable inference performance with minimal jitter. + * Average Inference Time: The mean time taken to process a single inference request across all runs. Lower values indicate faster model execution. + * Throughput: The number of inference requests processed per second. Higher throughput reflects the model’s ability to handle larger workloads efficiently. + * CPU Utilization: The percentage of CPU resources used during inference. A value close to 100% indicates full CPU usage, which is expected during performance benchmarking. + * Peak Memory Usage: The maximum amount of system memory (RAM) consumed during inference. Lower memory usage is beneficial for resource-constrained environments. + * P50 Latency (Median Latency): The time below which 50% of inference requests complete. Represents typical latency under normal load. + * Latency Consistency: Describes the stability of latency values across all runs. "Consistent" indicates predictable inference performance with minimal jitter. ### Benchmark summary on Arm64: Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**. @@ -109,30 +113,12 @@ Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pr | **Latency Consistency** | Consistent | -### Benchmark summary on x86 -Here is a summary of benchmark results collected on x86 **D4s_v6 Ubuntu Pro 24.04 LTS virtual machine**. - -| **Metric** | **Value on Virtual Machine** | -|----------------------------|-------------------------------| -| **Average Inference Time** | 1.413 ms | -| **Throughput** | 707.48 inferences/sec | -| **CPU Utilization** | 100% | -| **Peak Memory Usage** | 38.80 MB | -| **P50 Latency** | 1.396 ms | -| **P90 Latency** | 1.501 ms | -| **P95 Latency** | 1.520 ms | -| **P99 Latency** | 1.794 ms | -| **P999 Latency** | 1.794 ms | -| **Max Latency** | 1.794 ms | -| **Latency Consistency** | Consistent | - - -### Highlights from Ubuntu Pro 24.04 Arm64 Benchmarking +### Highlights from Benchmarking on Azure Cobalt 100 Arm64 VMs -When comparing the results on Arm64 vs x86_64 virtual machines: -- **Low-Latency Inference:** Achieved consistent average inference times of ~1.86 ms on Arm64. -- **Strong and Stable Throughput:** Sustained throughput of over 538 inferences/sec using the `squeezenet-int8.onnx` model on D4ps_v6 instances. -- **Lightweight Resource Footprint:** Peak memory usage stayed below 37 MB, with CPU utilization around 96%, ideal for efficient edge or cloud inference. -- **Consistent Performance:** P50, P95, and Max latency remained tightly bound, showcasing reliable performance on Azure Cobalt 100 Arm-based infrastructure. +The results on Arm64 virtual machines demonstrate: +- Low-Latency Inference: Achieved consistent average inference times of ~1.86 ms on Arm64. +- Strong and Stable Throughput: Sustained throughput of over 538 inferences/sec using the `squeezenet-int8.onnx` model on D4ps_v6 instances. +- Lightweight Resource Footprint: Peak memory usage stayed below 37 MB, with CPU utilization around 96%, ideal for efficient edge or cloud inference. +- Consistent Performance: P50, P95, and Max latency remained tightly bound, showcasing reliable performance on Azure Cobalt 100 Arm-based infrastructure. -You have now benchmarked ONNX on an Azure Cobalt 100 Arm64 virtual machine and compared results with x86_64. +You have now successfully benchmarked inference time of ONNX models on an Azure Cobalt 100 Arm64 virtual machine. diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/create-instance.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/create-instance.md index 9571395aa2..420b6ea4b8 100644 --- a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/create-instance.md +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/create-instance.md @@ -1,18 +1,24 @@ --- -title: Create an Arm based cloud virtual machine using Microsoft Cobalt 100 CPU +title: Create an Arm-based Azure VM with Cobalt 100 weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Introduction +## Set up your development environment -There are several ways to create an Arm-based Cobalt 100 virtual machine : the Microsoft Azure console, the Azure CLI tool, or using your choice of IaC (Infrastructure as Code). This guide will use the Azure console to create a virtual machine with Arm-based Cobalt 100 Processor. +There is more than one way to create an Arm-based Cobalt 100 virtual machine: -This learning path focuses on the general-purpose virtual machine of the D series. Please read the guide on [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series) offered by Microsoft Azure. +- The Microsoft Azure portal +- The Azure CLI +- Your preferred infrastructure as code (IaC) tool -If you have never used the Microsoft Cloud Platform before, please review the microsoft [guide to Create a Linux virtual machine in the Azure portal](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/quick-create-portal?tabs=ubuntu). +In this Learning Path, you will use the Azure portal to create a virtual machine with the Arm-based Azure Cobalt 100 processor. + +You will focus on the general-purpose virtual machines in the D-series. For further information, see the Microsoft Azure guide for the [Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series). + +While the steps to create this instance are included here for convenience, for further information on setting up Cobalt on Azure, see [Deploy a Cobalt 100 virtual machine on Azure Learning Path](/learning-paths/servers-and-cloud-computing/cobalt/). #### Create an Arm-based Azure Virtual Machine diff --git a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/deploy.md index 971777eb11..ed9ff8e35e 100644 --- a/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/deploy.md +++ b/content/learning-paths/servers-and-cloud-computing/onnx-on-azure/deploy.md @@ -8,13 +8,13 @@ layout: learningpathall ## ONNX Installation on Azure Ubuntu Pro 24.04 LTS -Install Python, create a virtual environment, and use pip to install ONNX, ONNX Runtime, and dependencies. Verify the setup and validate a sample ONNX model like SqueezeNet. +To work with ONNX models on Azure, you will need a clean Python environment with the required packages. The following steps install Python, set up a virtual environment, and prepare for ONNX model execution using ONNX Runtime. ### Install Python and Virtual Environment: ```console sudo apt update -sudo apt install -y python3 python3-pip python3-virtualenv +sudo apt install -y python3 python3-pip python3-virtualenv python3-venv ``` Create and activate a virtual environment: @@ -26,6 +26,7 @@ source onnx-env/bin/activate ### Install ONNX and Required Libraries: +Upgrade pip and install ONNX with its runtime and supporting libraries: ```console pip install --upgrade pip pip install onnx onnxruntime fastapi uvicorn numpy @@ -33,34 +34,38 @@ pip install onnx onnxruntime fastapi uvicorn numpy This installs ONNX libraries along with FastAPI (web serving) and NumPy (for input tensor generation). ### Validate ONNX and ONNX Runtime: -Create **version.py** as below: +Once the libraries are installed, you should verify that both ONNX and ONNX Runtime are correctly set up on your VM. +Create a file named `version.py` with the following code: ```python import onnx import onnxruntime -print("ONNX version:", onnx.version) +print("ONNX version:", onnx.__version__) print("ONNX Runtime version:", onnxruntime.__version__) ``` -Now, run version.py: +Run the script: ```console python3 version.py ``` -You should see an output similar to: +You should see output similar to: ```output -ONNX version: 1.18.0 -ONNX Runtime version: 1.22.0 +ONNX version: 1.19.0 +ONNX Runtime version: 1.23.0 ``` -### Download and Validate ONNX Model - SqueezeNet: -SqueezeNet is a lightweight convolutional neural network (CNN) architecture designed to achieve comparable accuracy to AlexNet, but with fewer parameters and smaller model size. +With this validation, you have confirmed that ONNX and ONNX Runtime are installed and ready on your Azure Cobalt 100 VM. This is the foundation for running inference workloads and serving ONNX models. +### Download and Validate ONNX Model - SqueezeNet: +SqueezeNet is a lightweight convolutional neural network (CNN) architecture designed to provide accuracy close to AlexNet while using 50x fewer parameters and a much smaller model size. This makes it well-suited for benchmarking ONNX Runtime. +Download the quantized model: ```console wget https://github.com/onnx/models/raw/main/validated/vision/classification/squeezenet/model/squeezenet1.0-12-int8.onnx -O squeezenet-int8.onnx ``` #### Validate the model: -Create a **vaildation.py** file with the code below for validation for ONNX model: +After downloading the SqueezeNet ONNX model, the next step is to confirm that it is structurally valid and compliant with the ONNX specification. ONNX provides a built-in checker utility that verifies the graph, operators, and metadata. +Create a file named `validation.py` with the following code: ```python import onnx @@ -69,10 +74,16 @@ model = onnx.load("squeezenet-int8.onnx") onnx.checker.check_model(model) print("✅ Model is valid!") ``` -You should see an output similar to: +Run the script: + +```bash +python3 validation.py +``` + +You should see output similar to: ```output ✅ Model is valid! ``` -This downloads a quantized (INT8) classification model, and validates its structure using ONNX’s built-in checker. +With this validation, you have confirmed that the quantized SqueezeNet model is valid and ONNX-compliant. The next step is to run inference with ONNX Runtime and to benchmark performance. ONNX installation and model validation are complete. You can now proceed with the baseline testing.