ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md‎
Lines changed: 61 additions & 57 deletions b/‎content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md‎
Lines changed: 61 additions & 57 deletions
@@ -1,67 +1,67 @@
 ---
-title: Understanding the Grace–Blackwell Architecture for Efficient AI Inference
+title: Verify Grace Blackwell system readiness for AI inference
 weight: 2
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Introduction to Grace–Blackwell Architecture
+## Introduction to Grace Blackwell architecture
+
+In this session, you will explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU–GPU hybrid for large-scale AI workloads.
 
-In this session, you will explore the architecture and system design of the **NVIDIA Grace–Blackwell ([DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/))** platform — a next-generation Arm-based CPU–GPU hybrid designed for large-scale AI workloads.
 You will also perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions.
 
-The NVIDIA DGX Spark is a personal AI supercomputer designed to bring data center–class AI computing directly to the developer’s desk.
-At the heart of DGX Spark lies the NVIDIA GB10 Grace–Blackwell Superchip, a breakthrough architecture that fuses CPU and GPU into a single, unified compute engine.
+The NVIDIA DGX Spark is a personal AI supercomputer that brings data center–class AI computing directly to the developer desktop.
+The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
 
-The **NVIDIA Grace–Blackwell DGX Spark (GB10)** platform combines:
-- The NVIDIA **Grace CPU**, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency.
+The NVIDIA Grace Blackwell DGX Spark (GB10) platform combines:
+- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency.
 
-- The NVIDIA **Blackwell GPU**, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads.
+- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads.
 - A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks.
 
-This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision, making DGX Spark a compact yet powerful development platform for modern AI workloads.
-
-DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere — empowering developers to prototype, fine-tune, and deploy large-scale AI models locally, while seamlessly connecting to the cloud or data center environments when needed.
+This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision.
+DGX Spark is a compact yet powerful development platform for modern AI workloads.
 
-More information about the NVIDIA DGX Spark can be found in this [blog](https://newsroom.arm.com/blog/arm-nvidia-dgx-spark-high-performance-ai).
+DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere, empowering developers to prototype, fine-tune, and deploy large-scale AI models locally while seamlessly connecting to cloud or data-center environments when needed.
 
+### Why Grace Blackwell for quantized LLMs?
 
-### Why Grace–Blackwell for Quantized LLMs?
-
-Quantized Large Language Models (LLMs) — such as those using Q4, Q5, or Q8 precision — benefit enormously from the hybrid architecture of the Grace–Blackwell Superchip.
+Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip.
 
 | **Feature** | **Impact on Quantized LLMs** |
 |--------------|------------------------------|
-| **Grace CPU (Arm Cortex-X925 / A725)** | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). |
-| **Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores)** | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. |
-| **High Bandwidth + Low Latency** | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. |
-| **Unified 128 GB Memory (NVLink-C2C)** | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. |
-| **Energy-Efficient Arm Design** | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. |
+| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). |
+| Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. |
+| High Bandwidth + Low Latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. |
+| Unified 128 GB Memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. |
+| Energy-Efficient Arm Design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. |
 
 
 In a typical quantized LLM workflow:
 - The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks.
 - The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput.
 - Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space — reducing copy overhead and enabling near-real-time inference.
 
-Together, these features make the GB10 not just a compute platform, but a developer-grade AI laboratory capable of running, profiling, and scaling quantized LLMs efficiently in a desktop form factor.
+Together, these features make the GB10 a developer-grade AI laboratory for running, profiling, and scaling quantized LLMs efficiently in a desktop form factor.
 
 
-### Inspecting Your GB10 Environment
+### Inspecting your GB10 environment
 
-Let’s confirm that your environment is ready for the sessions ahead.
+Let's verify that your DGX Spark system is configured and ready for building and running quantized LLMs.
 
 #### Step 1: Check CPU information
 
-Run the following commands to confirm CPU readiness:
+Run the following command to print the CPU information:
 
 ```bash
 lscpu
 ```
 
 Expected output:
-```log
+
+```output
 Architecture:             aarch64
   CPU op-mode(s):         64-bit
   Byte Order:             Little Endian
@@ -125,16 +125,16 @@ The following table summarizes the key specifications of the Grace CPU and expla
 
 | **Category** | **Specification** | **Description / Impact for LLM Inference** |
 |---------------|-------------------|---------------------------------------------|
-| **Architecture** | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. |
-| **Core Configuration** | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. |
-| **Threads per Core** | 1 | Optimized for deterministic scheduling and predictable latency. |
-| **Clock Frequency** | Up to **4.0 GHz** (Cortex-X925)<br>Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
-| **Cache Hierarchy** | L1: 1.3 MiB × 20<br>L2: 25 MiB × 20<br>L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. |
-| **Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. |
-| **NUMA Topology** | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. |
-| **Security & Reliability** | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. |
+| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. |
+| Core Configuration | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. |
+| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency. |
+| Clock Frequency | Up to **4.0 GHz** (Cortex-X925)<br>Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
+| Cache Hierarchy | L1: 1.3 MiB × 20<br>L2: 25 MiB × 20<br>L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. |
+| Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. |
+| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. |
+| Security & Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. |
 
-Its **SVE2**, **BF16**, and **INT8 matrix (I8MM)** capabilities make it ideal for **quantized LLM workloads**, providing a stable, power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing.
+Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities make it ideal for quantized LLM workloads, providing power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing.
 
 You can also verify the operating system running on your DGX Spark by using the following command:
 
@@ -143,27 +143,28 @@ lsb_release -a
 ```
 
 Expected output:
+
 ```log
 No LSB modules are available.
 Distributor ID:	Ubuntu
 Description:	Ubuntu 24.04.3 LTS
 Release:	24.04
 Codename:	noble
 ```
-As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a modern and developer-friendly Linux distribution.
+As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution.
 It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads.
 
+#### Step 2: Verify Blackwell GPU and driver
 
-#### Step 2: Verify Blackwell GPU and Driver
-
-After confirming your CPU configuration, you can verify that the **Blackwell GPU** inside the GB10 Grace–Blackwell Superchip is properly detected and ready for CUDA workloads.
+After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads.
 
 ```bash
 nvidia-smi
 ```
 
-Expected output:
-```log
+You will see output similar to:
+
+```output
 Wed Oct 22 09:26:54 2025       
 +-----------------------------------------------------------------------------------------+
 | NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
@@ -187,34 +188,37 @@ Wed Oct 22 09:26:54 2025
 +-----------------------------------------------------------------------------------------+
 ```
 
-The `nvidia-smi` tool not only reports GPU hardware specifications but also provides valuable runtime information — including driver status, temperature, power usage, and GPU utilization — which helps verify that the system is stable and ready for AI workloads.
+The `nvidia-smi` tool reports GPU hardware specifications and provides valuable runtime information, including driver status, temperature, power usage, and GPU utilization. This information helps verify that the system is ready for AI workloads.
+
+The table below provides more explanation of the `nvidia-smi` output:
 
-Understanding the Output of nvidia-smi
 | **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** |
 |---------------|--------------------------------------|---------------------------------------------|
-| **GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. |
-| **Driver Version** | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. |
-| **CUDA Version** | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. |
-| **Architecture / Compute Capability** | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. |
-| **Memory** | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. |
-| **Power & Thermal Status** | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. |
-| **GPU-Utilization** | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. |
-| **Memory Usage** | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. |
-| **Persistence Mode** | On | Ensures the GPU remains initialized and ready for rapid inference startup. |
+| GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. |
+| Driver Version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. |
+| CUDA Version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. |
+| Architecture / Compute Capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. |
+| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. |
+| Power & Thermal Status | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. |
+| GPU-Utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. |
+| Memory Usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. |
+| Persistence Mode | On | Ensures the GPU remains initialized and ready for rapid inference startup. |
 
 
 #### Step 3: Check CUDA Toolkit
 
-To build the CUDA version of llama.cpp, the system must have a valid CUDA toolkit installed.
-The command ***nvcc --version*** confirms that the CUDA compiler is available and compatible with CUDA 13.
+To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed.
+
+The `nvcc --version` command confirms that the CUDA compiler is available and compatible with CUDA 13.
 This ensures that CMake can correctly detect and compile the GPU-accelerated components.
 
 ```bash
 nvcc --version
 ```
 
-Expected output:
-```log
+You will see output similar to:
+
+```output
 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2025 NVIDIA Corporation
 Built on Wed_Aug_20_01:57:39_PM_PDT_2025
@@ -223,7 +227,7 @@ Build cuda_13.0.r13.0/compiler.36424714_0
 ```
 
 {{% notice Note %}}
-In this Learning Path, the nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
+The nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
 {{% /notice %}}
 
 This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation.
@@ -234,4 +238,4 @@ At this point, you have verified that:
 - The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime.
 - The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp.
 
-Your DGX Spark environment is now fully prepared for the next session,  where you will build and configure both CPU and GPU versions of **llama.cpp**, laying the foundation for running quantized LLMs efficiently on the Grace–Blackwell platform.
+Your DGX Spark environment is now fully prepared for the next section,  where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform.