diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md new file mode 100644 index 0000000000..fc44403c42 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md @@ -0,0 +1,237 @@ +--- +title: Understanding the Grace–Blackwell Architecture for Efficient AI Inference +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Introduction to Grace–Blackwell Architecture + +In this session, you will explore the architecture and system design of the **NVIDIA Grace–Blackwell ([DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/))** platform — a next-generation Arm-based CPU–GPU hybrid designed for large-scale AI workloads. +You will also perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions. + +The NVIDIA DGX Spark is a personal AI supercomputer designed to bring data center–class AI computing directly to the developer’s desk. +At the heart of DGX Spark lies the NVIDIA GB10 Grace–Blackwell Superchip, a breakthrough architecture that fuses CPU and GPU into a single, unified compute engine. + +The **NVIDIA Grace–Blackwell DGX Spark (GB10)** platform combines: +- The NVIDIA **Grace CPU**, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency. + +- The NVIDIA **Blackwell GPU**, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads. +- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks. + +This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision, making DGX Spark a compact yet powerful development platform for modern AI workloads. + +DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere — empowering developers to prototype, fine-tune, and deploy large-scale AI models locally, while seamlessly connecting to the cloud or data center environments when needed. + +More information about the NVIDIA DGX Spark can be found in this [blog](https://newsroom.arm.com/blog/arm-nvidia-dgx-spark-high-performance-ai). + + +### Why Grace–Blackwell for Quantized LLMs? + +Quantized Large Language Models (LLMs) — such as those using Q4, Q5, or Q8 precision — benefit enormously from the hybrid architecture of the Grace–Blackwell Superchip. + +| **Feature** | **Impact on Quantized LLMs** | +|--------------|------------------------------| +| **Grace CPU (Arm Cortex-X925 / A725)** | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). | +| **Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores)** | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. | +| **High Bandwidth + Low Latency** | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. | +| **Unified 128 GB Memory (NVLink-C2C)** | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. | +| **Energy-Efficient Arm Design** | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. | + + +In a typical quantized LLM workflow: +- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks. +- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput. +- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space — reducing copy overhead and enabling near-real-time inference. + +Together, these features make the GB10 not just a compute platform, but a developer-grade AI laboratory capable of running, profiling, and scaling quantized LLMs efficiently in a desktop form factor. + + +### Inspecting Your GB10 Environment + +Let’s confirm that your environment is ready for the sessions ahead. + +#### Step 1: Check CPU information + +Run the following commands to confirm CPU readiness: + +```bash +lscpu +``` + +Expected output: +```log +Architecture: aarch64 + CPU op-mode(s): 64-bit + Byte Order: Little Endian +CPU(s): 20 + On-line CPU(s) list: 0-19 +Vendor ID: ARM + Model name: Cortex-X925 + Model: 1 + Thread(s) per core: 1 + Core(s) per socket: 10 + Socket(s): 1 + Stepping: r0p1 + CPU(s) scaling MHz: 89% + CPU max MHz: 4004.0000 + CPU min MHz: 1378.0000 + BogoMIPS: 2000.00 + Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as + imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f + lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt + Model name: Cortex-A725 + Model: 1 + Thread(s) per core: 1 + Core(s) per socket: 10 + Socket(s): 1 + Stepping: r0p1 + CPU(s) scaling MHz: 99% + CPU max MHz: 2860.0000 + CPU min MHz: 338.0000 + BogoMIPS: 2000.00 + Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as + imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f + lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt +Caches (sum of all): + L1d: 1.3 MiB (20 instances) + L1i: 1.3 MiB (20 instances) + L2: 25 MiB (20 instances) + L3: 24 MiB (2 instances) +NUMA: + NUMA node(s): 1 + NUMA node0 CPU(s): 0-19 +Vulnerabilities: + Gather data sampling: Not affected + Itlb multihit: Not affected + L1tf: Not affected + Mds: Not affected + Meltdown: Not affected + Mmio stale data: Not affected + Reg file data sampling: Not affected + Retbleed: Not affected + Spec rstack overflow: Not affected + Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl + Spectre v1: Mitigation; __user pointer sanitization + Spectre v2: Not affected + Srbds: Not affected + Tsx async abort: Not affected +``` + +The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations. + +The following table summarizes the key specifications of the Grace CPU and explains their relevance to quantized LLM inference. + +| **Category** | **Specification** | **Description / Impact for LLM Inference** | +|---------------|-------------------|---------------------------------------------| +| **Architecture** | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. | +| **Core Configuration** | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. | +| **Threads per Core** | 1 | Optimized for deterministic scheduling and predictable latency. | +| **Clock Frequency** | Up to **4.0 GHz** (Cortex-X925)
Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. | +| **Cache Hierarchy** | L1: 1.3 MiB × 20
L2: 25 MiB × 20
L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. | +| **Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. | +| **NUMA Topology** | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. | +| **Security & Reliability** | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. | + +Its **SVE2**, **BF16**, and **INT8 matrix (I8MM)** capabilities make it ideal for **quantized LLM workloads**, providing a stable, power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing. + +You can also verify the operating system running on your DGX Spark by using the following command: + +```bash +lsb_release -a +``` + +Expected output: +```log +No LSB modules are available. +Distributor ID: Ubuntu +Description: Ubuntu 24.04.3 LTS +Release: 24.04 +Codename: noble +``` +As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a modern and developer-friendly Linux distribution. +It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads. + + +#### Step 2: Verify Blackwell GPU and Driver + +After confirming your CPU configuration, you can verify that the **Blackwell GPU** inside the GB10 Grace–Blackwell Superchip is properly detected and ready for CUDA workloads. + +```bash +nvidia-smi +``` + +Expected output: +```log +Wed Oct 22 09:26:54 2025 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A | +| N/A 32C P8 4W / N/A | Not Supported | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| 0 N/A N/A 3094 G /usr/lib/xorg/Xorg 43MiB | +| 0 N/A N/A 3172 G /usr/bin/gnome-shell 16MiB | ++-----------------------------------------------------------------------------------------+ +``` + +The `nvidia-smi` tool not only reports GPU hardware specifications but also provides valuable runtime information — including driver status, temperature, power usage, and GPU utilization — which helps verify that the system is stable and ready for AI workloads. + +Understanding the Output of nvidia-smi +| **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** | +|---------------|--------------------------------------|---------------------------------------------| +| **GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. | +| **Driver Version** | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. | +| **CUDA Version** | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. | +| **Architecture / Compute Capability** | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. | +| **Memory** | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. | +| **Power & Thermal Status** | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. | +| **GPU-Utilization** | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. | +| **Memory Usage** | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. | +| **Persistence Mode** | On | Ensures the GPU remains initialized and ready for rapid inference startup. | + + +#### Step 3: Check CUDA Toolkit + +To build the CUDA version of llama.cpp, the system must have a valid CUDA toolkit installed. +The command ***nvcc --version*** confirms that the CUDA compiler is available and compatible with CUDA 13. +This ensures that CMake can correctly detect and compile the GPU-accelerated components. + +```bash +nvcc --version +``` + +Expected output: +```log +nvcc: NVIDIA (R) Cuda compiler driver +Copyright (c) 2005-2025 NVIDIA Corporation +Built on Wed_Aug_20_01:57:39_PM_PDT_2025 +Cuda compilation tools, release 13.0, V13.0.88 +Build cuda_13.0.r13.0/compiler.36424714_0 +``` + +{{% notice Note %}} +In this Learning Path, the nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference. +{{% /notice %}} + +This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation. +If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121). + +At this point, you have verified that: +- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions. +- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime. +- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp. + +Your DGX Spark environment is now fully prepared for the next session, where you will build and configure both CPU and GPU versions of **llama.cpp**, laying the foundation for running quantized LLMs efficiently on the Grace–Blackwell platform. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md new file mode 100644 index 0000000000..fa6212fd6d --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md @@ -0,0 +1,186 @@ +--- +title: Building the GPU Version of llama.cpp on GB10 +weight: 3 +layout: "learningpathall" +--- + +## Building GPU Version of llama.cpp on GB10 + +In the previous session, you verified that your **DGX Spark (GB10)** system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment. + +Now that your hardware and drivers are ready, this session focuses on building the GPU-enabled version of **llama.cpp** — a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs. + +[llama.cpp](https://github.com/ggml-org/llama.cpp) is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs. You can also refer to this [learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/2_llama.cpp_intro/) + +### Step 1: Preparation + +In this step, you will install the necessary build tools and download a small quantized model for validation. + +```bash +sudo apt update +sudo apt install -y git cmake build-essential nvtop htop +``` + +These packages provide the C/C++ compiler toolchain, CMake build system, and GPU monitoring utility (nvtop) required to compile and test llama.cpp. + +To verify your GPU build later, you need at least one quantized model for testing. +First, ensure that you have the latest Hugging Face Hub CLI installed and download models: + +```bash +mkdir ~/models +cd ~/models + +python3 -m venv venv +pip install -U huggingface_hub + +hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF --local-dir TinyLlama-1.1B +``` + +After the download completes, those models will be available under `~/models` directory. + + +### Step 2: Clone the llama.cpp Repository + +In this step, you will download the source code for llama.cpp from GitHub. + +```bash +cd ~ +git clone https://github.com/ggerganov/llama.cpp.git +cd ~/llama.cpp +``` + +### Step 3: Configure and Build the CUDA-Enabled Version (GPU Mode) + +Run the following `cmake` command to configure the build system for GPU acceleration. +This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels. + +```bash +mkdir -p build-gpu +cd build-gpu + +cmake .. \ + -DCMAKE_BUILD_TYPE=Release \ + -DGGML_CUDA=ON \ + -DGGML_CUDA_F16=ON \ + -DCMAKE_CUDA_ARCHITECTURES=121 \ + -DCMAKE_C_COMPILER=gcc \ + -DCMAKE_CXX_COMPILER=g++ \ + -DCMAKE_CUDA_COMPILER=nvcc +``` + +Explanation of Key Flags: + +| **Feature** | **Description / Impact** | +|--------------|------------------------------| +| **-DGGML_CUDA=ON** | Enables the CUDA backend in llama.cpp, allowing matrix operations and transformer layers to be offloaded to the GPU for acceleration.| +| **-DGGML_CUDA_F16=ON** | Enables FP16 (half-precision) CUDA kernels, reducing memory usage and increasing throughput — especially effective for quantized models (e.g., Q4, Q5). | +| **-DCMAKE_CUDA_ARCHITECTURES=121** | Specifies the compute capability for the NVIDIA Blackwell GPU (GB10 = sm_121), ensuring the CUDA compiler (nvcc) generates optimized GPU kernels. | + +When the configuration process completes successfully, the terminal should display output similar to the following: + +``` +-- Configuring done (2.0s) +-- Generating done (0.1s) +-- Build files have been written to: /home/nvidia/llama.cpp/build-gpu +``` + +{{% notice Note %}} +1. For systems with multiple CUDA versions installed, explicitly specifying the compilers (`-DCMAKE_C_COMPILER`, `-DCMAKE_CXX_COMPILER`, `-DCMAKE_CUDA_COMPILER`) ensures that CMake uses the correct **CUDA 13.0 toolchain**. +2. In case of configuration errors, revisit **Session 1** to verify that your CUDA toolkit and driver versions are properly installed and aligned with **Blackwell (sm_121)** support. +{{% /notice %}} + +Once CMake configuration succeeds, start the compilation process: + +```bash +make -j"$(nproc)" +``` +This command compiles all CUDA and C++ source files in parallel using all available CPU cores. +The build process on the DGX Spark (GB10) typically completes within 2–4 minutes. +It was a remarkably fast experience on my system, demonstrating how efficiently the build process runs on local hardware. +Thanks to the high-performance Grace CPU and unified memory subsystem. + +Example build output: +``` +[ 0%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o +[ 15%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cpy.cu.o +[ 50%] Building CXX object src/CMakeFiles/llama.dir/llama-sampling.cpp.o +[100%] Built target test-backend-ops +[100%] Linking CXX executable ../../bin/llama-server +[100%] Built target llama-server +``` + +After the build completes, the GPU-accelerated binaries will be located under `~/llama.cpp/build-gpu/bin/` + +These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference via HTTP API (llama-server). You are now ready to test quantized LLMs with full GPU acceleration in the next step. + +Together, these options ensure that the build targets the Grace–Blackwell GPU with full CUDA 13 compatibility. + +### Step 4: Validate the CUDA-Enabled Build (GPU Mode) + +After the build completes successfully, verify that the GPU-enabled binary of ***llama.cpp*** is correctly linked to the NVIDIA CUDA runtime. + +To verify CUDA linkage, run the following command: + +```bash +ldd bin/llama-cli | grep cuda +``` + +Expected output: +``` +ldd bin/llama-cli | grep cuda + libggml-cuda.so => /home/nvidia/llama.cpp/build-gpu/bin/libggml-cuda.so (0x0000eee1e8e30000) + libcudart.so.13 => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13 (0x0000eee1e83b0000) + libcublas.so.13 => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13 (0x0000eee1e4860000) + libcuda.so.1 => /lib/aarch64-linux-gnu/libcuda.so.1 (0x0000eee1debd0000) + libcublasLt.so.13 => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13 (0x0000eee1b36c0000) +``` + +If the CUDA library is correctly linked, it confirms that the binary can access the GPU through the system driver. + +Next, confirm that the binary initializes the GPU correctly by checking device detection and compute capability. + +```bash +./bin/llama-server --version +``` + +Expected output: +``` +ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no +ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no +ggml_cuda_init: found 1 CUDA devices: + Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes +version: 6819 (19a5a3ed) +built with gcc (Ubuntu 12.4.0-2ubuntu1~24.04) 12.4.0 for aarch64-linux-gnu +``` + +The message "compute capability 12.1" confirms that the build was compiled specifically for the Blackwell GPU (sm_121) and that CUDA 13.0 is functioning correctly. + +Next, use the pre-downloaded quantized model (for example, TinyLlama-1.1B) to verify that inference executes successfully on the GPU: + +```bash +./bin/llama-cli \ + -m ~/models/TinyLlama-1.1B/tinyllama-1.1b-chat-v1.0.Q8_0.gguf \ + -ngl 32 \ + -t 16 \ + -p "Explain the advantages of the Armv9 architecture." +``` + +If the build is successful, you will see text generation begin within a few seconds. + +While `nvidia-smi` can display basic GPU information, `nvtop` provides real-time visualization of utilization, temperature, and power metrics — useful for verifying CUDA kernel activity during inference. + +```bash +nvtop +``` + +The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark. + +![image1 nvtop screenshot](nvtop.png "TinyLlama GPU Utilization") + +The nvtop interface shows: +- GPU Utilization (%) : confirm CUDA kernels are active +- Memory Usage (VRAM) : observe model loading and runtime footprint +- Temperature / Power Draw : monitor thermal stability under sustained workloads + +You have now successfully built and validated the CUDA-enabled version of llama.cpp on DGX Spark. +In the next session, you will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md new file mode 100644 index 0000000000..9d9d3fc93d --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md @@ -0,0 +1,139 @@ +--- +title: Building the CPU Version of llama.cpp on GB10 +weight: 4 +layout: "learningpathall" +--- + +## Building CPU Version of llama.cpp on GB10 + +### Step 1: Configure and Build the CPU-Only Version + +In this session, you will configure and build the CPU-only version of **llama.cpp**, optimized for the **Armv9**-based Grace CPU. + +This build runs entirely on the **Grace CPU (Arm Cortex-X925 and Cortex-A725)**, which supports advanced Armv9 vector extensions such as **SVE2**, **BF16**, and **I8MM**, making it highly efficient for quantized inference workloads even without GPU acceleration. + +Start from a clean directory to ensure a clean separation from the GPU build artifacts. +Run the following commands to configure the build system for the CPU-only version of llama.cpp. + +```bash +cd ~/llama.cpp +mkdir -p build-cpu +cd build-cpu + +cmake .. \ + -DCMAKE_BUILD_TYPE=Release \ + -DCMAKE_SYSTEM_PROCESSOR=aarch64 \ + -DLLAMA_ACCELERATE=ON \ + -DLLAMA_BLAS=OFF \ + -DCMAKE_C_COMPILER=gcc \ + -DCMAKE_CXX_COMPILER=g++ \ + -DCMAKE_C_FLAGS="-O3 -march=armv9-a+sve2+bf16+i8mm -mtune=native -fopenmp" \ + -DCMAKE_CXX_FLAGS="-O3 -march=armv9-a+sve2+bf16+i8mm -mtune=native -fopenmp" +``` + +Explanation of Key Flags: + +| **Feature** | **Description / Impact** | +|--------------|------------------------------| +| **-march=armv9-a** | Targets the Armv9-A architecture used by the Grace CPU and enables advanced vector extensions.| +| **+sve2+bf16+i8mm** | Activates Scalable Vector Extensions (SVE2), INT8 matrix multiply (I8MM), and BFloat16 operations for quantized inference.| +| **-fopenmp** | Enables multi-threaded execution via OpenMP, allowing all 20 Grace cores to be utilized.| +| **-mtune=native** | Optimizes code generation for the local Grace CPU microarchitecture.| +| **-DLLAMA_ACCELERATE=ON** | Enables llama.cpp’s internal ARM acceleration path (Neon/SVE optimized kernels).| + +When the configuration process completes successfully, the terminal should display output similar to the following: + +``` +-- Configuring done (1.1s) +-- Generating done (0.1s) +-- Build files have been written to: /home/nvidia/llama.cpp/build-cpu +``` + +Then, start the compilation process: + +```bash +make -j"$(nproc)" +``` + +{{% notice Note %}} +If the build fails after modifying optimization flags, it is likely due to a stale CMake cache. +Run the following commands to perform a clean reconfiguration: +```bash +cmake --fresh . +make -j"$(nproc)" +``` +{{% /notice %}} + + +The CPU build on the DGX Spark (GB10) completes in about 20 seconds — even faster than the GPU build. + +Example build output: +``` +[ 25%] Building CXX object src/CMakeFiles/llama.dir/llama-model-loader.cpp.o +[ 50%] Linking CXX executable ../bin/test-tokenizer-0 +[ 75%] Linking CXX executable ../bin/test-alloc +[100%] Linking CXX executable ../../bin/llama-server +[100%] Built target llama-server +``` + +After the build finishes, the CPU-optimized binaries will be available under `~/llama.cpp/build-cpu/bin/` + +### Step 2: Validate the CPU-Enabled Build (CPU Mode) + +In this step, you will validate that the binary was compiled in CPU-only mode and runs correctly on the Grace CPU. + +```bash +./bin/llama-server --version +``` + +Expected output: +``` +version: 6819 (19a5a3ed) +built with gcc (Ubuntu 12.4.0-2ubuntu1~24.04) 12.4.0 for aarch64-linux-gnu +``` +The message built without CUDA support confirms that this is a CPU-only binary optimized for the Grace CPU. + +Next, use the pre-downloaded quantized model (for example, TinyLlama-1.1B) to verify that inference executes successfully on the CPU: +Run the following command to confirm the build type and compiler information: + +```bash +./bin/llama-cli \ + -m ~/models/TinyLlama-1.1B/tinyllama-1.1b-chat-v1.0.Q8_0.gguf \ + -ngl 0 \ + -t 20 \ + -p "Explain the advantages of the Armv9 architecture." +``` + +- ***-ngl 0*** : Disables GPU offloading (CPU-only execution). +- ***-t 20*** : Uses 20 threads (1 per Grace CPU core). + +If the build is successful, you will observe smooth model initialization and token generation, with CPU utilization increasing across all cores. + +For live CPU utilization and power metrics, use `htop` instead of `nvtop`: + +```bash +htop +``` +The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement. +![image2 htop screenshot](htop.png "TinyLlama CPU Utilization") + +The `htop` interface shows: + +- **CPU Utilization**: All 20 cores operate between 75–85%, confirming efficient multi-thread scaling. +- **Load Average**: Around 5.0, indicating balanced workload distribution. +- **Memory Usage**: Approximately 4.5 GB total for the TinyLlama Q8_0 model. +- **Process List**: Displays multiple `llama-cli` threads (each 7–9% CPU), confirming OpenMP parallelism + +{{% notice Note %}} +In htop, press **`F6`** to sort by CPU% and verify load distribution, or press **`t`** to toggle the **tree view**, which clearly shows the `llama-cli` main process and its worker threads. +{{% /notice %}} + +In this session, you have: +- Built and validated the CPU-only version of llama.cpp. +- Optimized the Grace CPU build using Armv9 vector extensions (SVE2, BF16, I8MM). +- Tested quantized model inference using the TinyLlama Q8_0 model. +- Used monitoring tools (htop) to confirm efficient CPU utilization. + + +You have now successfully built and validated the CPU-only version of ***llama.cpp*** on the Grace CPU. +In the next session, you will learn how to use ***processwatch*** to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md new file mode 100644 index 0000000000..4406bcbfab --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md @@ -0,0 +1,212 @@ +--- +title: Analyzing Instruction Mix on Grace CPU using Process Watch +weight: 5 +layout: "learningpathall" +--- + +## Analyzing Instruction Mix on the Grace CPU Using Process Watch + +In this session, you will explore how the **Grace CPU** executes Armv9 vector and matrix instructions during quantized LLM inference. +By using **Process Watch**, you will observe how Neon SIMD instructions dominate execution on the Grace CPU and learn why SVE and SVE2 remain inactive under the current kernel configuration. +This exercise demonstrates how Armv9 vector execution behaves in real AI workloads and how hardware capabilities evolve—from traditional SIMD pipelines to scalable vector and matrix computation. + +### Step 1: Observe SIMD Execution with Process Watch + +Start by running a quantized model on the Grace CPU: + +In this step, you will install and configure **Process Watch**, an instruction-level profiling tool that shows live CPU instruction usage across threads. It supports real-time visualization of **NEON**, **SVE**, **FP**, and other vector and scalar instructions executed on Armv9 processors. + +```bash +sudo apt update +sudo apt install -y git cmake build-essential libncurses-dev libtinfo-dev +``` + +Use the following commands to download the source code, compile it, and install the binary into the processwatch directory. + +```bash +# Clone and build Process Watch +cd ~ +git clone --recursive https://github.com/intel/processwatch.git +cd processwatch +./build.sh +sudo ln -s ~/processwatch/processwatch /usr/local/bin/processwatch +``` + +To collect instruction-level metrics, ***Process Watch*** requires access to kernel performance counters and eBPF features. +Although it can run as a non-root user, full functionality requires elevated privileges. For simplicity and completeness, run it with administrative rights. +For safety and simplicity, it is recommended to run it with administrative rights. + +Run the following commands to enable the required permissions: +```bash +sudo setcap CAP_PERFMON,CAP_BPF=+ep ./processwatch +sudo sysctl -w kernel.perf_event_paranoid=-1 +sudo sysctl kernel.unprivileged_bpf_disabled=0 +``` + +These commands: +- Grant Process Watch the ability to use performance monitoring (perf) and eBPF tracing. +- Lower kernel restrictions on accessing performance counters. +- Allow unprivileged users to attach performance monitors. + +Verify the installation: + +```bash +./processwatch --help +``` + +You should see a usage summary similar to: +``` +usage: processwatch [options] +options: + -h Displays this help message. + -v Displays the version. + -i Prints results every seconds. + -n Prints results for intervals. + -c Prints all results in CSV format to stdout. + -p Only profiles . + -m Displays instruction mnemonics, instead of categories. + -s Profiles instructions with a sampling period of . Defaults to 100000 instructions (1 in 100000 instructions). + -f Can be used multiple times. Defines filters for columns. Defaults to 'FPARMv8', 'NEON', 'SVE' and 'SVE2'. + -a Displays a column for each category, mnemonic, or extension. This is a lot of output! + -l Prints a list of all available categories, mnemonics, or extensions. + -d Prints only debug information. +``` + +In this step, you will run a quantized TinyLlama model on the Grace CPU to generate live instruction activity. + +Use the same CPU-only llama.cpp build created in the previous session: + +```bash +cd ~/llama.cpp/build-cpu/bin +./llama-cli \ + -m ~/models/TinyLlama-1.1B/tinyllama-1.1b-chat-v1.0.Q8_0.gguf \ + -ngl 0 \ + -t 20 \ + -p "Explain the benefits of vector processing in modern Arm CPUs." +``` + +Keep this terminal running while the model generates text output. +You will now attach Process Watch to this active process. + +Once the llama.cpp process is running on the Grace CPU, attach Process Watch to observe its live instruction activity. +If only one ***llama-cli process*** is running, you can directly launch Process Watch without manually checking its PID: + +```bash +sudo processwatch --pid $(pgrep llama-cli) +``` + +This automatically attaches to the most active user-space process (typically llama-cli if it is the only inference task running). + +If multiple instances of llama-cli or other workloads are active, first list all running processes: + +```bash +pgrep llama-cli +``` + +Then attach Process Watch to monitor the instruction mix of this process: + +```bash +sudo processwatch --pid <> +``` +{{% notice Note %}} +processwatch --list does not display all system processes. +It is intended for internal use and may not list user-level tasks like llama-cli. +Use pgrep, ps -ef | grep llama, or htop to identify process IDs before attaching. +{{% /notice %}} + +The tool will display a live instruction breakdown similar to the following: +``` +PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL +ALL ALL 5.07 15.23 0.00 0.00 100.00 29272 +72930 llama-cli 5.07 15.23 0.00 0.00 100.00 29272 + +PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL +ALL ALL 2.57 9.95 0.00 0.00 100.00 69765 +72930 llama-cli 2.57 9.95 0.00 0.00 100.00 69765 + +PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL +ALL ALL 1.90 6.61 0.00 0.00 100.00 44249 +72930 llama-cli 1.90 6.61 0.00 0.00 100.00 44249 + +PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL +ALL ALL 2.60 10.16 0.00 0.00 100.00 71049 +72930 llama-cli 2.60 10.16 0.00 0.00 100.00 71049 + +PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL +ALL ALL 2.12 7.56 0.00 0.00 100.00 68553 +72930 llama-cli 2.12 7.56 0.00 0.00 100.00 68553 + +PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL +ALL ALL 2.52 9.40 0.00 0.00 100.00 65339 +72930 llama-cli 2.52 9.40 0.00 0.00 100.00 65339 + +PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL +ALL ALL 2.34 7.76 0.00 0.00 100.00 42015 +72930 llama-cli 2.34 7.76 0.00 0.00 100.00 42015 + +PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL +ALL ALL 2.66 9.77 0.00 0.00 100.00 74616 +72930 llama-cli 2.66 9.77 0.00 0.00 100.00 74616 + +PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL +ALL ALL 2.15 7.06 0.00 0.00 100.00 58496 +72930 llama-cli 2.15 7.06 0.00 0.00 100.00 58496 + +PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL +ALL ALL 2.61 9.34 0.00 0.00 100.00 73365 +72930 llama-cli 2.61 9.34 0.00 0.00 100.00 73365 + +PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL +ALL ALL 2.52 8.37 0.00 0.00 100.00 26566 +72930 llama-cli 2.52 8.37 0.00 0.00 100.00 26566 +``` + +Interpretation: +- NEON (≈ 7–15 %) : Active SIMD integer and floating-point operations. +- FPARMv8 : Scalar FP operations (e.g., activation, normalization). +- SVE/SVE2 = 0 : The kernel is restricted to 128-bit vectors and does not issue SVE instructions. + +This confirms that the Grace CPU performs quantized inference primarily using Neon SIMD pipelines. + + +### Step 2: Why SVE and SVE2 Remain Inactive + +Although the Grace CPU supports SVE and SVE2, the current NVIDIA Grace kernel limits the default vector length to 16 bytes (128-bit). +This restriction ensures binary compatibility with existing Neon-optimized workloads. + +You can confirm this setting by: +```bash +cat /proc/sys/abi/sve_default_vector_length +``` + +Output: +``` +16 +``` + +Even if you try to increase the length: + +```bash +echo 256 | sudo tee /proc/sys/abi/sve_default_vector_length +cat /proc/sys/abi/sve_default_vector_length +``` + +It will revert to 16. +This behavior is expected — SVE is enabled but fixed at 128 bits, so Neon remains the active execution path. + +{{% notice Note %}} +The current kernel image restricts the SVE vector length to 128 bits to maintain compatibility with existing software stacks. +Future kernel updates are expected to introduce configurable SVE vector lengths (for example, 256-bit or 512-bit). +This Learning Path will be revised accordingly once those capabilities become available on the Grace platform. +{{% /notice %}} + +In this session, you used ***Process Watch*** to observe instruction activity on the Grace CPU and interpret how Armv9 vector instructions are utilized during quantized LLM inference. +You confirmed that Neon SIMD remains the primary execution path under the current kernel configuration, while SVE and SVE2 are enabled but restricted to 128-bit vector length for compatibility reasons. + +This experiment highlights how architectural features evolve over time — the Grace CPU already implements advanced Armv9 capabilities, and future kernel releases will unlock their full potential. + +By mastering these observation tools and understanding the instruction mix, you are now better equipped to: +- Profile Arm-based systems at the architectural level, +- Interpret real-time performance data meaningfully, and +- Prepare your applications for future Armv9 enhancements. + diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md new file mode 100644 index 0000000000..67bc52c917 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md @@ -0,0 +1,51 @@ +--- +title: Deploying Quantized LLMs on DGX Spark using llama.cpp + +minutes_to_complete: 60 + +who_is_this_for: This session is intended for AI practitioners, performance engineers, and system architects who want to understand how the Grace–Blackwell (GB10) platform enables efficient quantized LLM inference through CPU–GPU collaboration. + +learning_objectives: + - Understand the Grace–Blackwell (GB10) architecture and how it supports efficient AI inference. + - Build and validate both CUDA 13-enabled and CPU-only versions of llama.cpp for flexible deployment of quantized LLMs on the GB10 platform. + - Observe and interpret how Armv9 SIMD instructions (Neon, SVE) are utilized during quantized LLM inference on the Grace CPU using Process Watch. + +prerequisites: + - One NVIDIA DGX Spark system with at least 15 GB of available disk space. + +author: Odin Shen + +### Tags +skilllevels: Introductory +subjects: ML +armips: + - Cortex-X + - Cortex-A +operatingsystems: + - Linux +tools_software_languages: + - Python + - C++ + - Bash + - llama.cpp + +further_reading: + - resource: + title: Nvidia DGX Spark + link: https://www.nvidia.com/en-gb/products/workstations/dgx-spark/ + type: website + - resource: + title: Nvidia DGX Spark Playbooks + link: https://github.com/NVIDIA/dgx-spark-playbooks + type: documentation + - resource: + title: Arm Blog Post + link: https://newsroom.arm.com/blog/arm-powered-nvidia-dgx-spark-ai-workstations + type: Blog + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_next-steps.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/htop.png b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/htop.png new file mode 100644 index 0000000000..0bcd461ce8 Binary files /dev/null and b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/htop.png differ diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/nvtop.png b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/nvtop.png new file mode 100644 index 0000000000..dbdb78ef15 Binary files /dev/null and b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/nvtop.png differ