|
| 1 | +--- |
| 2 | +title: Understanding the Grace–Blackwell Architecture for Efficient AI Inference |
| 3 | +weight: 2 |
| 4 | + |
| 5 | +### FIXED, DO NOT MODIFY |
| 6 | +layout: learningpathall |
| 7 | +--- |
| 8 | + |
| 9 | +## Introduction to Grace–Blackwell Architecture |
| 10 | + |
| 11 | +In this session, you will explore the architecture and system design of the **NVIDIA Grace–Blackwell ([DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/))** platform — a next-generation Arm-based CPU–GPU hybrid designed for large-scale AI workloads. |
| 12 | +You will also perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions. |
| 13 | + |
| 14 | +The NVIDIA DGX Spark is a personal AI supercomputer designed to bring data center–class AI computing directly to the developer’s desk. |
| 15 | +At the heart of DGX Spark lies the NVIDIA GB10 Grace–Blackwell Superchip, a breakthrough architecture that fuses CPU and GPU into a single, unified compute engine. |
| 16 | + |
| 17 | +The **NVIDIA Grace–Blackwell DGX Spark (GB10)** platform combines: |
| 18 | +- The NVIDIA **Grace CPU**, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency. |
| 19 | + |
| 20 | +- The NVIDIA **Blackwell GPU**, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads. |
| 21 | +- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks. |
| 22 | + |
| 23 | +This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision, making DGX Spark a compact yet powerful development platform for modern AI workloads. |
| 24 | + |
| 25 | +DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere — empowering developers to prototype, fine-tune, and deploy large-scale AI models locally, while seamlessly connecting to the cloud or data center environments when needed. |
| 26 | + |
| 27 | +More information about the NVIDIA DGX Spark can be found in this [blog](https://newsroom.arm.com/blog/arm-nvidia-dgx-spark-high-performance-ai). |
| 28 | + |
| 29 | + |
| 30 | +### Why Grace–Blackwell for Quantized LLMs? |
| 31 | + |
| 32 | +Quantized Large Language Models (LLMs) — such as those using Q4, Q5, or Q8 precision — benefit enormously from the hybrid architecture of the Grace–Blackwell Superchip. |
| 33 | + |
| 34 | +| **Feature** | **Impact on Quantized LLMs** | |
| 35 | +|--------------|------------------------------| |
| 36 | +| **Grace CPU (Arm Cortex-X925 / A725)** | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). | |
| 37 | +| **Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores)** | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. | |
| 38 | +| **High Bandwidth + Low Latency** | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. | |
| 39 | +| **Unified 128 GB Memory (NVLink-C2C)** | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. | |
| 40 | +| **Energy-Efficient Arm Design** | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. | |
| 41 | + |
| 42 | + |
| 43 | +In a typical quantized LLM workflow: |
| 44 | +- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks. |
| 45 | +- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput. |
| 46 | +- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space — reducing copy overhead and enabling near-real-time inference. |
| 47 | + |
| 48 | +Together, these features make the GB10 not just a compute platform, but a developer-grade AI laboratory capable of running, profiling, and scaling quantized LLMs efficiently in a desktop form factor. |
| 49 | + |
| 50 | + |
| 51 | +### Inspecting Your GB10 Environment |
| 52 | + |
| 53 | +Let’s confirm that your environment is ready for the sessions ahead. |
| 54 | + |
| 55 | +#### Step 1: Check CPU information |
| 56 | + |
| 57 | +Run the following commands to confirm CPU readiness: |
| 58 | + |
| 59 | +```bash |
| 60 | +lscpu |
| 61 | +``` |
| 62 | + |
| 63 | +Expected output: |
| 64 | +```log |
| 65 | +Architecture: aarch64 |
| 66 | + CPU op-mode(s): 64-bit |
| 67 | + Byte Order: Little Endian |
| 68 | +CPU(s): 20 |
| 69 | + On-line CPU(s) list: 0-19 |
| 70 | +Vendor ID: ARM |
| 71 | + Model name: Cortex-X925 |
| 72 | + Model: 1 |
| 73 | + Thread(s) per core: 1 |
| 74 | + Core(s) per socket: 10 |
| 75 | + Socket(s): 1 |
| 76 | + Stepping: r0p1 |
| 77 | + CPU(s) scaling MHz: 89% |
| 78 | + CPU max MHz: 4004.0000 |
| 79 | + CPU min MHz: 1378.0000 |
| 80 | + BogoMIPS: 2000.00 |
| 81 | + Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as |
| 82 | + imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f |
| 83 | + lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt |
| 84 | + Model name: Cortex-A725 |
| 85 | + Model: 1 |
| 86 | + Thread(s) per core: 1 |
| 87 | + Core(s) per socket: 10 |
| 88 | + Socket(s): 1 |
| 89 | + Stepping: r0p1 |
| 90 | + CPU(s) scaling MHz: 99% |
| 91 | + CPU max MHz: 2860.0000 |
| 92 | + CPU min MHz: 338.0000 |
| 93 | + BogoMIPS: 2000.00 |
| 94 | + Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as |
| 95 | + imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f |
| 96 | + lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt |
| 97 | +Caches (sum of all): |
| 98 | + L1d: 1.3 MiB (20 instances) |
| 99 | + L1i: 1.3 MiB (20 instances) |
| 100 | + L2: 25 MiB (20 instances) |
| 101 | + L3: 24 MiB (2 instances) |
| 102 | +NUMA: |
| 103 | + NUMA node(s): 1 |
| 104 | + NUMA node0 CPU(s): 0-19 |
| 105 | +Vulnerabilities: |
| 106 | + Gather data sampling: Not affected |
| 107 | + Itlb multihit: Not affected |
| 108 | + L1tf: Not affected |
| 109 | + Mds: Not affected |
| 110 | + Meltdown: Not affected |
| 111 | + Mmio stale data: Not affected |
| 112 | + Reg file data sampling: Not affected |
| 113 | + Retbleed: Not affected |
| 114 | + Spec rstack overflow: Not affected |
| 115 | + Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl |
| 116 | + Spectre v1: Mitigation; __user pointer sanitization |
| 117 | + Spectre v2: Not affected |
| 118 | + Srbds: Not affected |
| 119 | + Tsx async abort: Not affected |
| 120 | +``` |
| 121 | + |
| 122 | +The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations. |
| 123 | + |
| 124 | +The following table summarizes the key specifications of the Grace CPU and explains their relevance to quantized LLM inference. |
| 125 | + |
| 126 | +| **Category** | **Specification** | **Description / Impact for LLM Inference** | |
| 127 | +|---------------|-------------------|---------------------------------------------| |
| 128 | +| **Architecture** | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. | |
| 129 | +| **Core Configuration** | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. | |
| 130 | +| **Threads per Core** | 1 | Optimized for deterministic scheduling and predictable latency. | |
| 131 | +| **Clock Frequency** | Up to **4.0 GHz** (Cortex-X925)<br>Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. | |
| 132 | +| **Cache Hierarchy** | L1: 1.3 MiB × 20<br>L2: 25 MiB × 20<br>L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. | |
| 133 | +| **Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. | |
| 134 | +| **NUMA Topology** | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. | |
| 135 | +| **Security & Reliability** | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. | |
| 136 | + |
| 137 | +Its **SVE2**, **BF16**, and **INT8 matrix (I8MM)** capabilities make it ideal for **quantized LLM workloads**, providing a stable, power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing. |
| 138 | + |
| 139 | +You can also verify the operating system running on your DGX Spark by using the following command: |
| 140 | + |
| 141 | +```bash |
| 142 | +lsb_release -a |
| 143 | +``` |
| 144 | + |
| 145 | +Expected output: |
| 146 | +```log |
| 147 | +No LSB modules are available. |
| 148 | +Distributor ID: Ubuntu |
| 149 | +Description: Ubuntu 24.04.3 LTS |
| 150 | +Release: 24.04 |
| 151 | +Codename: noble |
| 152 | +``` |
| 153 | +As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a modern and developer-friendly Linux distribution. |
| 154 | +It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads. |
| 155 | + |
| 156 | + |
| 157 | +#### Step 2: Verify Blackwell GPU and Driver |
| 158 | + |
| 159 | +After confirming your CPU configuration, you can verify that the **Blackwell GPU** inside the GB10 Grace–Blackwell Superchip is properly detected and ready for CUDA workloads. |
| 160 | + |
| 161 | +```bash |
| 162 | +nvidia-smi |
| 163 | +``` |
| 164 | + |
| 165 | +Expected output: |
| 166 | +```log |
| 167 | +Wed Oct 22 09:26:54 2025 |
| 168 | ++-----------------------------------------------------------------------------------------+ |
| 169 | +| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | |
| 170 | ++-----------------------------------------+------------------------+----------------------+ |
| 171 | +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | |
| 172 | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | |
| 173 | +| | | MIG M. | |
| 174 | +|=========================================+========================+======================| |
| 175 | +| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A | |
| 176 | +| N/A 32C P8 4W / N/A | Not Supported | 0% Default | |
| 177 | +| | | N/A | |
| 178 | ++-----------------------------------------+------------------------+----------------------+ |
| 179 | +
|
| 180 | ++-----------------------------------------------------------------------------------------+ |
| 181 | +| Processes: | |
| 182 | +| GPU GI CI PID Type Process name GPU Memory | |
| 183 | +| ID ID Usage | |
| 184 | +|=========================================================================================| |
| 185 | +| 0 N/A N/A 3094 G /usr/lib/xorg/Xorg 43MiB | |
| 186 | +| 0 N/A N/A 3172 G /usr/bin/gnome-shell 16MiB | |
| 187 | ++-----------------------------------------------------------------------------------------+ |
| 188 | +``` |
| 189 | + |
| 190 | +The `nvidia-smi` tool not only reports GPU hardware specifications but also provides valuable runtime information — including driver status, temperature, power usage, and GPU utilization — which helps verify that the system is stable and ready for AI workloads. |
| 191 | + |
| 192 | +Understanding the Output of nvidia-smi |
| 193 | +| **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** | |
| 194 | +|---------------|--------------------------------------|---------------------------------------------| |
| 195 | +| **GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. | |
| 196 | +| **Driver Version** | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. | |
| 197 | +| **CUDA Version** | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. | |
| 198 | +| **Architecture / Compute Capability** | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. | |
| 199 | +| **Memory** | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. | |
| 200 | +| **Power & Thermal Status** | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. | |
| 201 | +| **GPU-Utilization** | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. | |
| 202 | +| **Memory Usage** | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. | |
| 203 | +| **Persistence Mode** | On | Ensures the GPU remains initialized and ready for rapid inference startup. | |
| 204 | + |
| 205 | + |
| 206 | +#### Step 3: Check CUDA Toolkit |
| 207 | + |
| 208 | +To build the CUDA version of llama.cpp, the system must have a valid CUDA toolkit installed. |
| 209 | +The command ***nvcc --version*** confirms that the CUDA compiler is available and compatible with CUDA 13. |
| 210 | +This ensures that CMake can correctly detect and compile the GPU-accelerated components. |
| 211 | + |
| 212 | +```bash |
| 213 | +nvcc --version |
| 214 | +``` |
| 215 | + |
| 216 | +Expected output: |
| 217 | +```log |
| 218 | +nvcc: NVIDIA (R) Cuda compiler driver |
| 219 | +Copyright (c) 2005-2025 NVIDIA Corporation |
| 220 | +Built on Wed_Aug_20_01:57:39_PM_PDT_2025 |
| 221 | +Cuda compilation tools, release 13.0, V13.0.88 |
| 222 | +Build cuda_13.0.r13.0/compiler.36424714_0 |
| 223 | +``` |
| 224 | + |
| 225 | +{{% notice Note %}} |
| 226 | +In this Learning Path, the nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference. |
| 227 | +{{% /notice %}} |
| 228 | + |
| 229 | +This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation. |
| 230 | +If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121). |
| 231 | + |
| 232 | +At this point, you have verified that: |
| 233 | +- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions. |
| 234 | +- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime. |
| 235 | +- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp. |
| 236 | + |
| 237 | +Your DGX Spark environment is now fully prepared for the next session, where you will build and configure both CPU and GPU versions of **llama.cpp**, laying the foundation for running quantized LLMs efficiently on the Grace–Blackwell platform. |
0 commit comments