diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md
new file mode 100644
index 0000000000..fc44403c42
--- /dev/null
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md
@@ -0,0 +1,237 @@
+---
+title: Understanding the Grace–Blackwell Architecture for Efficient AI Inference
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Introduction to Grace–Blackwell Architecture
+
+In this session, you will explore the architecture and system design of the **NVIDIA Grace–Blackwell ([DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/))** platform — a next-generation Arm-based CPU–GPU hybrid designed for large-scale AI workloads.
+You will also perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions.
+
+The NVIDIA DGX Spark is a personal AI supercomputer designed to bring data center–class AI computing directly to the developer’s desk.
+At the heart of DGX Spark lies the NVIDIA GB10 Grace–Blackwell Superchip, a breakthrough architecture that fuses CPU and GPU into a single, unified compute engine.
+
+The **NVIDIA Grace–Blackwell DGX Spark (GB10)** platform combines:
+- The NVIDIA **Grace CPU**, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency.
+
+- The NVIDIA **Blackwell GPU**, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads.
+- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks.
+
+This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision, making DGX Spark a compact yet powerful development platform for modern AI workloads.
+
+DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere — empowering developers to prototype, fine-tune, and deploy large-scale AI models locally, while seamlessly connecting to the cloud or data center environments when needed.
+
+More information about the NVIDIA DGX Spark can be found in this [blog](https://newsroom.arm.com/blog/arm-nvidia-dgx-spark-high-performance-ai).
+
+
+### Why Grace–Blackwell for Quantized LLMs?
+
+Quantized Large Language Models (LLMs) — such as those using Q4, Q5, or Q8 precision — benefit enormously from the hybrid architecture of the Grace–Blackwell Superchip.
+
+| **Feature** | **Impact on Quantized LLMs** |
+|--------------|------------------------------|
+| **Grace CPU (Arm Cortex-X925 / A725)** | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). |
+| **Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores)** | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. |
+| **High Bandwidth + Low Latency** | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. |
+| **Unified 128 GB Memory (NVLink-C2C)** | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. |
+| **Energy-Efficient Arm Design** | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. |
+
+
+In a typical quantized LLM workflow:
+- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks.
+- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput.
+- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space — reducing copy overhead and enabling near-real-time inference.
+
+Together, these features make the GB10 not just a compute platform, but a developer-grade AI laboratory capable of running, profiling, and scaling quantized LLMs efficiently in a desktop form factor.
+
+
+### Inspecting Your GB10 Environment
+
+Let’s confirm that your environment is ready for the sessions ahead.
+
+#### Step 1: Check CPU information
+
+Run the following commands to confirm CPU readiness:
+
+```bash
+lscpu
+```
+
+Expected output:
+```log
+Architecture:             aarch64
+  CPU op-mode(s):         64-bit
+  Byte Order:             Little Endian
+CPU(s):                   20
+  On-line CPU(s) list:    0-19
+Vendor ID:                ARM
+  Model name:             Cortex-X925
+    Model:                1
+    Thread(s) per core:   1
+    Core(s) per socket:   10
+    Socket(s):            1
+    Stepping:             r0p1
+    CPU(s) scaling MHz:   89%
+    CPU max MHz:          4004.0000
+    CPU min MHz:          1378.0000
+    BogoMIPS:             2000.00
+    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
+                          imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
+                          lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
+  Model name:             Cortex-A725
+    Model:                1
+    Thread(s) per core:   1
+    Core(s) per socket:   10
+    Socket(s):            1
+    Stepping:             r0p1
+    CPU(s) scaling MHz:   99%
+    CPU max MHz:          2860.0000
+    CPU min MHz:          338.0000
+    BogoMIPS:             2000.00
+    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
+                          imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
+                          lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
+Caches (sum of all):      
+  L1d:                    1.3 MiB (20 instances)
+  L1i:                    1.3 MiB (20 instances)
+  L2:                     25 MiB (20 instances)
+  L3:                     24 MiB (2 instances)
+NUMA:                     
+  NUMA node(s):           1
+  NUMA node0 CPU(s):      0-19
+Vulnerabilities:          
+  Gather data sampling:   Not affected
+  Itlb multihit:          Not affected
+  L1tf:                   Not affected
+  Mds:                    Not affected
+  Meltdown:               Not affected
+  Mmio stale data:        Not affected
+  Reg file data sampling: Not affected
+  Retbleed:               Not affected
+  Spec rstack overflow:   Not affected
+  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
+  Spectre v1:             Mitigation; __user pointer sanitization
+  Spectre v2:             Not affected
+  Srbds:                  Not affected
+  Tsx async abort:        Not affected
+```
+
+The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
+
+The following table summarizes the key specifications of the Grace CPU and explains their relevance to quantized LLM inference.
+
+| **Category** | **Specification** | **Description / Impact for LLM Inference** |
+|---------------|-------------------|---------------------------------------------|
+| **Architecture** | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. |
+| **Core Configuration** | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. |
+| **Threads per Core** | 1 | Optimized for deterministic scheduling and predictable latency. |
+| **Clock Frequency** | Up to **4.0 GHz** (Cortex-X925)<br>Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
+| **Cache Hierarchy** | L1: 1.3 MiB × 20<br>L2: 25 MiB × 20<br>L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. |
+| **Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. |
+| **NUMA Topology** | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. |
+| **Security & Reliability** | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. |
+
+Its **SVE2**, **BF16**, and **INT8 matrix (I8MM)** capabilities make it ideal for **quantized LLM workloads**, providing a stable, power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing.
+
+You can also verify the operating system running on your DGX Spark by using the following command:
+
+```bash
+lsb_release -a
+```
+
+Expected output:
+```log
+No LSB modules are available.
+Distributor ID:	Ubuntu
+Description:	Ubuntu 24.04.3 LTS
+Release:	24.04
+Codename:	noble
+```
+As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a modern and developer-friendly Linux distribution.
+It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads.
+
+
+#### Step 2: Verify Blackwell GPU and Driver
+
+After confirming your CPU configuration, you can verify that the **Blackwell GPU** inside the GB10 Grace–Blackwell Superchip is properly detected and ready for CUDA workloads.
+
+```bash
+nvidia-smi
+```
+
+Expected output:
+```log
+Wed Oct 22 09:26:54 2025       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
++-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
+| N/A   32C    P8              4W /  N/A  | Not Supported          |      0%      Default |
+|                                         |                        |                  N/A |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|    0   N/A  N/A            3094      G   /usr/lib/xorg/Xorg                       43MiB |
+|    0   N/A  N/A            3172      G   /usr/bin/gnome-shell                     16MiB |
++-----------------------------------------------------------------------------------------+
+```
+
+The `nvidia-smi` tool not only reports GPU hardware specifications but also provides valuable runtime information — including driver status, temperature, power usage, and GPU utilization — which helps verify that the system is stable and ready for AI workloads.
+
+Understanding the Output of nvidia-smi
+| **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** |
+|---------------|--------------------------------------|---------------------------------------------|
+| **GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. |
+| **Driver Version** | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. |
+| **CUDA Version** | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. |
+| **Architecture / Compute Capability** | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. |
+| **Memory** | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. |
+| **Power & Thermal Status** | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. |
+| **GPU-Utilization** | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. |
+| **Memory Usage** | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. |
+| **Persistence Mode** | On | Ensures the GPU remains initialized and ready for rapid inference startup. |
+
+
+#### Step 3: Check CUDA Toolkit
+
+To build the CUDA version of llama.cpp, the system must have a valid CUDA toolkit installed.
+The command ***nvcc --version*** confirms that the CUDA compiler is available and compatible with CUDA 13.
+This ensures that CMake can correctly detect and compile the GPU-accelerated components.
+
+```bash
+nvcc --version
+```
+
+Expected output:
+```log
+nvcc: NVIDIA (R) Cuda compiler driver
+Copyright (c) 2005-2025 NVIDIA Corporation
+Built on Wed_Aug_20_01:57:39_PM_PDT_2025
+Cuda compilation tools, release 13.0, V13.0.88
+Build cuda_13.0.r13.0/compiler.36424714_0
+```
+
+{{% notice Note %}}
+In this Learning Path, the nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
+{{% /notice %}}
+
+This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation.
+If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121).
+
+At this point, you have verified that:
+- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions.
+- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime.
+- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp.
+
+Your DGX Spark environment is now fully prepared for the next session,  where you will build and configure both CPU and GPU versions of **llama.cpp**, laying the foundation for running quantized LLMs efficiently on the Grace–Blackwell platform.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md
new file mode 100644
index 0000000000..fa6212fd6d
--- /dev/null
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md
@@ -0,0 +1,186 @@
+---
+title: Building the GPU Version of llama.cpp on GB10
+weight: 3
+layout: "learningpathall"
+---
+
+## Building GPU Version of llama.cpp on GB10
+
+In the previous session, you verified that your **DGX Spark (GB10)** system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment.
+
+Now that your hardware and drivers are ready, this session focuses on building the GPU-enabled version of **llama.cpp** — a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs.
+
+[llama.cpp](https://github.com/ggml-org/llama.cpp) is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs. You can also refer to this [learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro/)
+
+### Step 1: Preparation
+
+In this step, you will install the necessary build tools and download a small quantized model for validation.
+
+```bash
+sudo apt update
+sudo apt install -y git cmake build-essential nvtop htop
+```
+
+These packages provide the C/C++ compiler toolchain, CMake build system, and GPU monitoring utility (nvtop) required to compile and test llama.cpp.
+
+To verify your GPU build later, you need at least one quantized model for testing.
+First, ensure that you have the latest Hugging Face Hub CLI installed and download models:
+
+```bash
+mkdir ~/models
+cd ~/models
+
+python3 -m venv venv
+pip install -U huggingface_hub
+
+hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF --local-dir TinyLlama-1.1B
+```
+
+After the download completes, those models will be available under `~/models` directory.
+
+
+### Step 2: Clone the llama.cpp Repository
+
+In this step, you will download the source code for llama.cpp from GitHub.
+
+```bash
+cd ~
+git clone https://github.com/ggerganov/llama.cpp.git
+cd ~/llama.cpp
+```
+
+### Step 3: Configure and Build the CUDA-Enabled Version (GPU Mode)
+
+Run the following `cmake` command to configure the build system for GPU acceleration.
+This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels.
+
+```bash
+mkdir -p build-gpu
+cd build-gpu
+
+cmake .. \
+	-DCMAKE_BUILD_TYPE=Release \
+	-DGGML_CUDA=ON \
+	-DGGML_CUDA_F16=ON \
+	-DCMAKE_CUDA_ARCHITECTURES=121 \
+	-DCMAKE_C_COMPILER=gcc \
+	-DCMAKE_CXX_COMPILER=g++ \
+	-DCMAKE_CUDA_COMPILER=nvcc
+```
+
+Explanation of Key Flags:
+
+| **Feature** | **Description / Impact** |
+|--------------|------------------------------|
+| **-DGGML_CUDA=ON** | Enables the CUDA backend in llama.cpp, allowing matrix operations and transformer layers to be offloaded to the GPU for acceleration.|
+| **-DGGML_CUDA_F16=ON** | Enables FP16 (half-precision) CUDA kernels, reducing memory usage and increasing throughput — especially effective for quantized models (e.g., Q4, Q5). |
+| **-DCMAKE_CUDA_ARCHITECTURES=121** | Specifies the compute capability for the NVIDIA Blackwell GPU (GB10 = sm_121), ensuring the CUDA compiler (nvcc) generates optimized GPU kernels. |
+
+When the configuration process completes successfully, the terminal should display output similar to the following:
+
+```
+-- Configuring done (2.0s)
+-- Generating done (0.1s)
+-- Build files have been written to: /home/nvidia/llama.cpp/build-gpu
+```
+
+{{% notice Note %}}
+1. For systems with multiple CUDA versions installed, explicitly specifying the compilers (`-DCMAKE_C_COMPILER`, `-DCMAKE_CXX_COMPILER`, `-DCMAKE_CUDA_COMPILER`) ensures that CMake uses the correct **CUDA 13.0 toolchain**.  
+2. In case of configuration errors, revisit **Session 1** to verify that your CUDA toolkit and driver versions are properly installed and aligned with **Blackwell (sm_121)** support.
+{{% /notice %}}
+
+Once CMake configuration succeeds, start the compilation process:
+
+```bash
+make -j"$(nproc)"
+```
+This command compiles all CUDA and C++ source files in parallel using all available CPU cores.
+The build process on the DGX Spark (GB10) typically completes within 2–4 minutes. 
+It was a remarkably fast experience on my system, demonstrating how efficiently the build process runs on local hardware.
+Thanks to the high-performance Grace CPU and unified memory subsystem.
+
+Example build output:
+```
+[  0%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o
+[ 15%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cpy.cu.o
+[ 50%] Building CXX object src/CMakeFiles/llama.dir/llama-sampling.cpp.o
+[100%] Built target test-backend-ops
+[100%] Linking CXX executable ../../bin/llama-server
+[100%] Built target llama-server
+```
+
+After the build completes, the GPU-accelerated binaries will be located under `~/llama.cpp/build-gpu/bin/`
+
+These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference via HTTP API (llama-server). You are now ready to test quantized LLMs with full GPU acceleration in the next step.
+
+Together, these options ensure that the build targets the Grace–Blackwell GPU with full CUDA 13 compatibility.
+
+### Step 4: Validate the CUDA-Enabled Build (GPU Mode)
+
+After the build completes successfully, verify that the GPU-enabled binary of ***llama.cpp*** is correctly linked to the NVIDIA CUDA runtime.
+
+To verify CUDA linkage, run the following command:
+
+```bash
+ldd bin/llama-cli | grep cuda
+```
+
+Expected output:
+```
+ldd bin/llama-cli | grep cuda
+	libggml-cuda.so => /home/nvidia/llama.cpp/build-gpu/bin/libggml-cuda.so (0x0000eee1e8e30000)
+	libcudart.so.13 => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13 (0x0000eee1e83b0000)
+	libcublas.so.13 => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13 (0x0000eee1e4860000)
+	libcuda.so.1 => /lib/aarch64-linux-gnu/libcuda.so.1 (0x0000eee1debd0000)
+	libcublasLt.so.13 => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13 (0x0000eee1b36c0000)
+```
+
+If the CUDA library is correctly linked, it confirms that the binary can access the GPU through the system driver.
+
+Next, confirm that the binary initializes the GPU correctly by checking device detection and compute capability.
+
+```bash
+./bin/llama-server --version
+```
+
+Expected output:
+```
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 1 CUDA devices:
+  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
+version: 6819 (19a5a3ed)
+built with gcc (Ubuntu 12.4.0-2ubuntu1~24.04) 12.4.0 for aarch64-linux-gnu
+```
+
+The message "compute capability 12.1" confirms that the build was compiled specifically for the Blackwell GPU (sm_121) and that CUDA 13.0 is functioning correctly.
+
+Next, use the pre-downloaded quantized model (for example, TinyLlama-1.1B) to verify that inference executes successfully on the GPU:
+
+```bash
+./bin/llama-cli \
+	-m ~/models/TinyLlama-1.1B/tinyllama-1.1b-chat-v1.0.Q8_0.gguf \
+	-ngl 32 \
+	-t 16 \
+	-p "Explain the advantages of the Armv9 architecture."
+```
+
+If the build is successful, you will see text generation begin within a few seconds.
+
+While `nvidia-smi` can display basic GPU information, `nvtop` provides real-time visualization of utilization, temperature, and power metrics — useful for verifying CUDA kernel activity during inference.
+
+```bash
+nvtop
+```
+
+The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark.
+
+![image1 nvtop screenshot](nvtop.png "TinyLlama GPU Utilization")
+
+The nvtop interface shows:
+- GPU Utilization (%) : confirm CUDA kernels are active
+- Memory Usage (VRAM) : observe model loading and runtime footprint
+- Temperature / Power Draw : monitor thermal stability under sustained workloads
+
+You have now successfully built and validated the CUDA-enabled version of llama.cpp on DGX Spark.
+In the next session, you will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md
new file mode 100644
index 0000000000..9d9d3fc93d
--- /dev/null
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md
@@ -0,0 +1,139 @@
+---
+title: Building the CPU Version of llama.cpp on GB10
+weight: 4
+layout: "learningpathall"
+---
+
+## Building CPU Version of llama.cpp on GB10
+
+### Step 1: Configure and Build the CPU-Only Version
+
+In this session, you will configure and build the CPU-only version of **llama.cpp**, optimized for the **Armv9**-based Grace CPU.
+
+This build runs entirely on the **Grace CPU (Arm Cortex-X925 and Cortex-A725)**, which supports advanced Armv9 vector extensions such as **SVE2**, **BF16**, and **I8MM**, making it highly efficient for quantized inference workloads even without GPU acceleration.
+
+Start from a clean directory to ensure a clean separation from the GPU build artifacts.
+Run the following commands to configure the build system for the CPU-only version of llama.cpp.
+
+```bash
+cd ~/llama.cpp
+mkdir -p build-cpu
+cd build-cpu
+
+cmake .. \
+	-DCMAKE_BUILD_TYPE=Release \
+	-DCMAKE_SYSTEM_PROCESSOR=aarch64 \
+	-DLLAMA_ACCELERATE=ON \
+	-DLLAMA_BLAS=OFF \
+	-DCMAKE_C_COMPILER=gcc \
+	-DCMAKE_CXX_COMPILER=g++ \
+	-DCMAKE_C_FLAGS="-O3 -march=armv9-a+sve2+bf16+i8mm -mtune=native -fopenmp" \
+	-DCMAKE_CXX_FLAGS="-O3 -march=armv9-a+sve2+bf16+i8mm -mtune=native -fopenmp"
+```
+
+Explanation of Key Flags:
+
+| **Feature** | **Description / Impact** |
+|--------------|------------------------------|
+| **-march=armv9-a** | Targets the Armv9-A architecture used by the Grace CPU and enables advanced vector extensions.|
+| **+sve2+bf16+i8mm** | Activates Scalable Vector Extensions (SVE2), INT8 matrix multiply (I8MM), and BFloat16 operations for quantized inference.|
+| **-fopenmp** | Enables multi-threaded execution via OpenMP, allowing all 20 Grace cores to be utilized.|
+| **-mtune=native** | Optimizes code generation for the local Grace CPU microarchitecture.|
+| **-DLLAMA_ACCELERATE=ON** | Enables llama.cpp’s internal ARM acceleration path (Neon/SVE optimized kernels).|
+
+When the configuration process completes successfully, the terminal should display output similar to the following:
+
+```
+-- Configuring done (1.1s)
+-- Generating done (0.1s)
+-- Build files have been written to: /home/nvidia/llama.cpp/build-cpu
+```
+
+Then, start the compilation process:
+
+```bash
+make -j"$(nproc)"
+```
+
+{{% notice Note %}}
+If the build fails after modifying optimization flags, it is likely due to a stale CMake cache.
+Run the following commands to perform a clean reconfiguration:
+```bash
+cmake --fresh .
+make -j"$(nproc)"
+```
+{{% /notice %}}
+
+
+The CPU build on the DGX Spark (GB10) completes in about 20 seconds — even faster than the GPU build.
+
+Example build output:
+```
+[ 25%] Building CXX object src/CMakeFiles/llama.dir/llama-model-loader.cpp.o
+[ 50%] Linking CXX executable ../bin/test-tokenizer-0
+[ 75%] Linking CXX executable ../bin/test-alloc
+[100%] Linking CXX executable ../../bin/llama-server
+[100%] Built target llama-server
+```
+
+After the build finishes, the CPU-optimized binaries will be available under `~/llama.cpp/build-cpu/bin/`
+
+### Step 2: Validate the CPU-Enabled Build (CPU Mode)
+
+In this step, you will validate that the binary was compiled in CPU-only mode and runs correctly on the Grace CPU.
+
+```bash
+./bin/llama-server --version
+```
+
+Expected output:
+```
+version: 6819 (19a5a3ed)
+built with gcc (Ubuntu 12.4.0-2ubuntu1~24.04) 12.4.0 for aarch64-linux-gnu
+```
+The message built without CUDA support confirms that this is a CPU-only binary optimized for the Grace CPU.
+
+Next, use the pre-downloaded quantized model (for example, TinyLlama-1.1B) to verify that inference executes successfully on the CPU:
+Run the following command to confirm the build type and compiler information:
+
+```bash
+./bin/llama-cli \
+	-m ~/models/TinyLlama-1.1B/tinyllama-1.1b-chat-v1.0.Q8_0.gguf \
+	-ngl 0 \
+	-t 20 \
+	-p "Explain the advantages of the Armv9 architecture."
+```
+
+- ***-ngl 0*** : Disables GPU offloading (CPU-only execution).
+- ***-t 20*** : Uses 20 threads (1 per Grace CPU core).
+
+If the build is successful, you will observe smooth model initialization and token generation, with CPU utilization increasing across all cores.
+
+For live CPU utilization and power metrics, use `htop` instead of `nvtop`:
+
+```bash
+htop
+```
+The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement.
+![image2 htop screenshot](htop.png "TinyLlama CPU Utilization")
+
+The `htop` interface shows:
+
+- **CPU Utilization**: All 20 cores operate between 75–85%, confirming efficient multi-thread scaling.
+- **Load Average**: Around 5.0, indicating balanced workload distribution.
+- **Memory Usage**: Approximately 4.5 GB total for the TinyLlama Q8_0 model.
+- **Process List**: Displays multiple `llama-cli` threads (each 7–9% CPU), confirming OpenMP parallelism 
+
+{{% notice Note %}}
+In htop, press **`F6`** to sort by CPU% and verify load distribution, or press **`t`** to toggle the **tree view**, which clearly shows the `llama-cli` main process and its worker threads.
+{{% /notice %}}
+
+In this session, you have:
+- Built and validated the CPU-only version of llama.cpp.
+- Optimized the Grace CPU build using Armv9 vector extensions (SVE2, BF16, I8MM).
+- Tested quantized model inference using the TinyLlama Q8_0 model.
+- Used monitoring tools (htop) to confirm efficient CPU utilization.
+
+
+You have now successfully built and validated the CPU-only version of ***llama.cpp*** on the Grace CPU.
+In the next session, you will learn how to use ***processwatch*** to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md
new file mode 100644
index 0000000000..4406bcbfab
--- /dev/null
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md
@@ -0,0 +1,212 @@
+---
+title: Analyzing Instruction Mix on Grace CPU using Process Watch
+weight: 5
+layout: "learningpathall"
+---
+
+## Analyzing Instruction Mix on the Grace CPU Using Process Watch
+
+In this session, you will explore how the **Grace CPU** executes Armv9 vector and matrix instructions during quantized LLM inference.
+By using **Process Watch**, you will observe how Neon SIMD instructions dominate execution on the Grace CPU and learn why SVE and SVE2 remain inactive under the current kernel configuration.
+This exercise demonstrates how Armv9 vector execution behaves in real AI workloads and how hardware capabilities evolve—from traditional SIMD pipelines to scalable vector and matrix computation.
+
+### Step 1: Observe SIMD Execution with Process Watch
+
+Start by running a quantized model on the Grace CPU:
+
+In this step, you will install and configure **Process Watch**, an instruction-level profiling tool that shows live CPU instruction usage across threads. It supports real-time visualization of **NEON**, **SVE**, **FP**, and other vector and scalar instructions executed on Armv9 processors.
+
+```bash
+sudo apt update
+sudo apt install -y git cmake build-essential libncurses-dev libtinfo-dev
+```
+
+Use the following commands to download the source code, compile it, and install the binary into the processwatch directory.
+
+```bash
+# Clone and build Process Watch
+cd ~
+git clone --recursive https://github.com/intel/processwatch.git
+cd processwatch
+./build.sh
+sudo ln -s ~/processwatch/processwatch /usr/local/bin/processwatch
+```
+
+To collect instruction-level metrics, ***Process Watch*** requires access to kernel performance counters and eBPF features.
+Although it can run as a non-root user, full functionality requires elevated privileges. For simplicity and completeness, run it with administrative rights.
+For safety and simplicity, it is recommended to run it with administrative rights.
+
+Run the following commands to enable the required permissions:
+```bash
+sudo setcap CAP_PERFMON,CAP_BPF=+ep ./processwatch
+sudo sysctl -w kernel.perf_event_paranoid=-1
+sudo sysctl kernel.unprivileged_bpf_disabled=0
+```
+
+These commands:
+- Grant Process Watch the ability to use performance monitoring (perf) and eBPF tracing.
+- Lower kernel restrictions on accessing performance counters.
+- Allow unprivileged users to attach performance monitors.
+
+Verify the installation:
+
+```bash
+./processwatch --help
+```
+
+You should see a usage summary similar to:
+```
+usage: processwatch [options]
+options:
+  -h          Displays this help message.
+  -v          Displays the version.
+  -i <int>    Prints results every <int> seconds.
+  -n <num>    Prints results for <num> intervals.
+  -c          Prints all results in CSV format to stdout.
+  -p <pid>    Only profiles <pid>.
+  -m          Displays instruction mnemonics, instead of categories.
+  -s <samp>   Profiles instructions with a sampling period of <samp>. Defaults to 100000 instructions (1 in 100000 instructions).
+  -f <filter> Can be used multiple times. Defines filters for columns. Defaults to 'FPARMv8', 'NEON', 'SVE' and 'SVE2'.
+  -a          Displays a column for each category, mnemonic, or extension. This is a lot of output!
+  -l          Prints a list of all available categories, mnemonics, or extensions.
+  -d          Prints only debug information.
+```
+
+In this step, you will run a quantized TinyLlama model on the Grace CPU to generate live instruction activity.
+
+Use the same CPU-only llama.cpp build created in the previous session:
+
+```bash
+cd ~/llama.cpp/build-cpu/bin
+./llama-cli \
+	-m ~/models/TinyLlama-1.1B/tinyllama-1.1b-chat-v1.0.Q8_0.gguf \
+	-ngl 0 \
+	-t 20 \
+	-p "Explain the benefits of vector processing in modern Arm CPUs."
+```
+
+Keep this terminal running while the model generates text output.
+You will now attach Process Watch to this active process.
+
+Once the llama.cpp process is running on the Grace CPU, attach Process Watch to observe its live instruction activity.
+If only one ***llama-cli process*** is running, you can directly launch Process Watch without manually checking its PID:
+
+```bash
+sudo processwatch --pid $(pgrep llama-cli)
+```
+
+This automatically attaches to the most active user-space process (typically llama-cli if it is the only inference task running).
+
+If multiple instances of llama-cli or other workloads are active, first list all running processes:
+
+```bash
+pgrep llama-cli
+```
+
+Then attach Process Watch to monitor the instruction mix of this process:
+
+```bash
+sudo processwatch --pid <<LLAMA-CLI ID>>
+```
+{{% notice Note %}}
+processwatch --list does not display all system processes.
+It is intended for internal use and may not list user-level tasks like llama-cli.
+Use pgrep, ps -ef | grep llama, or htop to identify process IDs before attaching.
+{{% /notice %}}
+
+The tool will display a live instruction breakdown similar to the following:
+```
+PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL   
+ALL      ALL              5.07     15.23    0.00     0.00     100.00   29272   
+72930    llama-cli        5.07     15.23    0.00     0.00     100.00   29272   
+
+PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL   
+ALL      ALL              2.57     9.95     0.00     0.00     100.00   69765   
+72930    llama-cli        2.57     9.95     0.00     0.00     100.00   69765   
+
+PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL   
+ALL      ALL              1.90     6.61     0.00     0.00     100.00   44249   
+72930    llama-cli        1.90     6.61     0.00     0.00     100.00   44249   
+
+PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL   
+ALL      ALL              2.60     10.16    0.00     0.00     100.00   71049   
+72930    llama-cli        2.60     10.16    0.00     0.00     100.00   71049   
+
+PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL   
+ALL      ALL              2.12     7.56     0.00     0.00     100.00   68553   
+72930    llama-cli        2.12     7.56     0.00     0.00     100.00   68553   
+
+PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL   
+ALL      ALL              2.52     9.40     0.00     0.00     100.00   65339   
+72930    llama-cli        2.52     9.40     0.00     0.00     100.00   65339   
+
+PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL   
+ALL      ALL              2.34     7.76     0.00     0.00     100.00   42015   
+72930    llama-cli        2.34     7.76     0.00     0.00     100.00   42015   
+
+PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL   
+ALL      ALL              2.66     9.77     0.00     0.00     100.00   74616   
+72930    llama-cli        2.66     9.77     0.00     0.00     100.00   74616   
+
+PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL   
+ALL      ALL              2.15     7.06     0.00     0.00     100.00   58496   
+72930    llama-cli        2.15     7.06     0.00     0.00     100.00   58496   
+
+PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL   
+ALL      ALL              2.61     9.34     0.00     0.00     100.00   73365   
+72930    llama-cli        2.61     9.34     0.00     0.00     100.00   73365   
+
+PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL   
+ALL      ALL              2.52     8.37     0.00     0.00     100.00   26566   
+72930    llama-cli        2.52     8.37     0.00     0.00     100.00   26566   
+```
+
+Interpretation:
+- NEON (≈ 7–15 %) : Active SIMD integer and floating-point operations.
+- FPARMv8         : Scalar FP operations (e.g., activation, normalization).
+- SVE/SVE2 = 0    : The kernel is restricted to 128-bit vectors and does not issue SVE instructions.
+
+This confirms that the Grace CPU performs quantized inference primarily using Neon SIMD pipelines.
+
+
+### Step 2: Why SVE and SVE2 Remain Inactive
+
+Although the Grace CPU supports SVE and SVE2, the current NVIDIA Grace kernel limits the default vector length to 16 bytes (128-bit).
+This restriction ensures binary compatibility with existing Neon-optimized workloads.
+
+You can confirm this setting by:
+```bash
+cat /proc/sys/abi/sve_default_vector_length
+```
+
+Output:
+```
+16
+```
+
+Even if you try to increase the length:
+
+```bash
+echo 256 | sudo tee /proc/sys/abi/sve_default_vector_length
+cat /proc/sys/abi/sve_default_vector_length
+```
+
+It will revert to 16.
+This behavior is expected — SVE is enabled but fixed at 128 bits, so Neon remains the active execution path.
+
+{{% notice Note %}}
+The current kernel image restricts the SVE vector length to 128 bits to maintain compatibility with existing software stacks.
+Future kernel updates are expected to introduce configurable SVE vector lengths (for example, 256-bit or 512-bit).
+This Learning Path will be revised accordingly once those capabilities become available on the Grace platform.
+{{% /notice %}}
+
+In this session, you used ***Process Watch*** to observe instruction activity on the Grace CPU and interpret how Armv9 vector instructions are utilized during quantized LLM inference.
+You confirmed that Neon SIMD remains the primary execution path under the current kernel configuration, while SVE and SVE2 are enabled but restricted to 128-bit vector length for compatibility reasons.
+
+This experiment highlights how architectural features evolve over time — the Grace CPU already implements advanced Armv9 capabilities, and future kernel releases will unlock their full potential.
+
+By mastering these observation tools and understanding the instruction mix, you are now better equipped to:
+- Profile Arm-based systems at the architectural level,
+- Interpret real-time performance data meaningfully, and
+- Prepare your applications for future Armv9 enhancements.
+
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md
new file mode 100644
index 0000000000..67bc52c917
--- /dev/null
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md
@@ -0,0 +1,51 @@
+---
+title: Deploying Quantized LLMs on DGX Spark using llama.cpp
+
+minutes_to_complete: 60
+
+who_is_this_for: This session is intended for AI practitioners, performance engineers, and system architects who want to understand how the Grace–Blackwell (GB10) platform enables efficient quantized LLM inference through CPU–GPU collaboration.
+
+learning_objectives:
+    - Understand the Grace–Blackwell (GB10) architecture and how it supports efficient AI inference.
+    - Build and validate both CUDA 13-enabled and CPU-only versions of llama.cpp for flexible deployment of quantized LLMs on the GB10 platform.
+    - Observe and interpret how Armv9 SIMD instructions (Neon, SVE) are utilized during quantized LLM inference on the Grace CPU using Process Watch.
+
+prerequisites:
+    - One NVIDIA DGX Spark system with at least 15 GB of available disk space.
+
+author: Odin Shen
+
+### Tags
+skilllevels: Introductory
+subjects: ML
+armips:
+    - Cortex-X
+    - Cortex-A
+operatingsystems:
+    - Linux
+tools_software_languages:
+    - Python
+    - C++
+    - Bash
+    - llama.cpp
+
+further_reading:
+    - resource:
+        title: Nvidia DGX Spark
+        link: https://www.nvidia.com/en-gb/products/workstations/dgx-spark/
+        type: website
+    - resource:
+        title: Nvidia DGX Spark Playbooks
+        link: https://github.com/NVIDIA/dgx-spark-playbooks
+        type: documentation
+    - resource:
+        title: Arm Blog Post
+        link: https://newsroom.arm.com/blog/arm-powered-nvidia-dgx-spark-ai-workstations
+        type: Blog
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_next-steps.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_next-steps.md
new file mode 100644
index 0000000000..c3db0de5a2
--- /dev/null
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # Set to always be larger than the content in this path to be at the end of the navigation.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/htop.png b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/htop.png
new file mode 100644
index 0000000000..0bcd461ce8
Binary files /dev/null and b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/htop.png differ
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/nvtop.png b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/nvtop.png
new file mode 100644
index 0000000000..dbdb78ef15
Binary files /dev/null and b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/nvtop.png differ