Skip to content

Commit de650a2

Browse files
Merge pull request #2489 from jasonrandrews/review
review llama.cpp on GB10
2 parents 3ec77f9 + 137a06f commit de650a2

File tree

5 files changed

+193
-186
lines changed

5 files changed

+193
-186
lines changed

content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md

Lines changed: 61 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,67 +1,67 @@
11
---
2-
title: Understanding the GraceBlackwell Architecture for Efficient AI Inference
2+
title: Verify Grace Blackwell system readiness for AI inference
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Introduction to Grace–Blackwell Architecture
9+
## Introduction to Grace Blackwell architecture
10+
11+
In this session, you will explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU–GPU hybrid for large-scale AI workloads.
1012

11-
In this session, you will explore the architecture and system design of the **NVIDIA Grace–Blackwell ([DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/))** platform — a next-generation Arm-based CPU–GPU hybrid designed for large-scale AI workloads.
1213
You will also perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions.
1314

14-
The NVIDIA DGX Spark is a personal AI supercomputer designed to bring data center–class AI computing directly to the developer’s desk.
15-
At the heart of DGX Spark lies the NVIDIA GB10 GraceBlackwell Superchip, a breakthrough architecture that fuses CPU and GPU into a single, unified compute engine.
15+
The NVIDIA DGX Spark is a personal AI supercomputer that brings data center–class AI computing directly to the developer desktop.
16+
The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
1617

17-
The **NVIDIA GraceBlackwell DGX Spark (GB10)** platform combines:
18-
- The NVIDIA **Grace CPU**, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency.
18+
The NVIDIA Grace Blackwell DGX Spark (GB10) platform combines:
19+
- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency.
1920

20-
- The NVIDIA **Blackwell GPU**, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads.
21+
- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads.
2122
- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks.
2223

23-
This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision, making DGX Spark a compact yet powerful development platform for modern AI workloads.
24-
25-
DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere — empowering developers to prototype, fine-tune, and deploy large-scale AI models locally, while seamlessly connecting to the cloud or data center environments when needed.
24+
This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision.
25+
DGX Spark is a compact yet powerful development platform for modern AI workloads.
2626

27-
More information about the NVIDIA DGX Spark can be found in this [blog](https://newsroom.arm.com/blog/arm-nvidia-dgx-spark-high-performance-ai).
27+
DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere, empowering developers to prototype, fine-tune, and deploy large-scale AI models locally while seamlessly connecting to cloud or data-center environments when needed.
2828

29+
### Why Grace Blackwell for quantized LLMs?
2930

30-
### Why Grace–Blackwell for Quantized LLMs?
31-
32-
Quantized Large Language Models (LLMs) — such as those using Q4, Q5, or Q8 precision — benefit enormously from the hybrid architecture of the Grace–Blackwell Superchip.
31+
Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip.
3332

3433
| **Feature** | **Impact on Quantized LLMs** |
3534
|--------------|------------------------------|
36-
| **Grace CPU (Arm Cortex-X925 / A725)** | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). |
37-
| **Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores)** | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. |
38-
| **High Bandwidth + Low Latency** | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. |
39-
| **Unified 128 GB Memory (NVLink-C2C)** | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. |
40-
| **Energy-Efficient Arm Design** | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. |
35+
| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). |
36+
| Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. |
37+
| High Bandwidth + Low Latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. |
38+
| Unified 128 GB Memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. |
39+
| Energy-Efficient Arm Design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. |
4140

4241

4342
In a typical quantized LLM workflow:
4443
- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks.
4544
- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput.
4645
- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space — reducing copy overhead and enabling near-real-time inference.
4746

48-
Together, these features make the GB10 not just a compute platform, but a developer-grade AI laboratory capable of running, profiling, and scaling quantized LLMs efficiently in a desktop form factor.
47+
Together, these features make the GB10 a developer-grade AI laboratory for running, profiling, and scaling quantized LLMs efficiently in a desktop form factor.
4948

5049

51-
### Inspecting Your GB10 Environment
50+
### Inspecting your GB10 environment
5251

53-
Let’s confirm that your environment is ready for the sessions ahead.
52+
Let's verify that your DGX Spark system is configured and ready for building and running quantized LLMs.
5453

5554
#### Step 1: Check CPU information
5655

57-
Run the following commands to confirm CPU readiness:
56+
Run the following command to print the CPU information:
5857

5958
```bash
6059
lscpu
6160
```
6261

6362
Expected output:
64-
```log
63+
64+
```output
6565
Architecture: aarch64
6666
CPU op-mode(s): 64-bit
6767
Byte Order: Little Endian
@@ -125,16 +125,16 @@ The following table summarizes the key specifications of the Grace CPU and expla
125125

126126
| **Category** | **Specification** | **Description / Impact for LLM Inference** |
127127
|---------------|-------------------|---------------------------------------------|
128-
| **Architecture** | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. |
129-
| **Core Configuration** | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. |
130-
| **Threads per Core** | 1 | Optimized for deterministic scheduling and predictable latency. |
131-
| **Clock Frequency** | Up to **4.0 GHz** (Cortex-X925)<br>Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
132-
| **Cache Hierarchy** | L1: 1.3 MiB × 20<br>L2: 25 MiB × 20<br>L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. |
133-
| **Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. |
134-
| **NUMA Topology** | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. |
135-
| **Security & Reliability** | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. |
128+
| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. |
129+
| Core Configuration | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. |
130+
| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency. |
131+
| Clock Frequency | Up to **4.0 GHz** (Cortex-X925)<br>Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
132+
| Cache Hierarchy | L1: 1.3 MiB × 20<br>L2: 25 MiB × 20<br>L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. |
133+
| Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. |
134+
| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. |
135+
| Security & Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. |
136136

137-
Its **SVE2**, **BF16**, and **INT8 matrix (I8MM)** capabilities make it ideal for **quantized LLM workloads**, providing a stable, power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing.
137+
Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities make it ideal for quantized LLM workloads, providing power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing.
138138

139139
You can also verify the operating system running on your DGX Spark by using the following command:
140140

@@ -143,27 +143,28 @@ lsb_release -a
143143
```
144144

145145
Expected output:
146+
146147
```log
147148
No LSB modules are available.
148149
Distributor ID: Ubuntu
149150
Description: Ubuntu 24.04.3 LTS
150151
Release: 24.04
151152
Codename: noble
152153
```
153-
As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a modern and developer-friendly Linux distribution.
154+
As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution.
154155
It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads.
155156

157+
#### Step 2: Verify Blackwell GPU and driver
156158

157-
#### Step 2: Verify Blackwell GPU and Driver
158-
159-
After confirming your CPU configuration, you can verify that the **Blackwell GPU** inside the GB10 Grace–Blackwell Superchip is properly detected and ready for CUDA workloads.
159+
After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads.
160160

161161
```bash
162162
nvidia-smi
163163
```
164164

165-
Expected output:
166-
```log
165+
You will see output similar to:
166+
167+
```output
167168
Wed Oct 22 09:26:54 2025
168169
+-----------------------------------------------------------------------------------------+
169170
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
@@ -187,34 +188,37 @@ Wed Oct 22 09:26:54 2025
187188
+-----------------------------------------------------------------------------------------+
188189
```
189190

190-
The `nvidia-smi` tool not only reports GPU hardware specifications but also provides valuable runtime information — including driver status, temperature, power usage, and GPU utilization — which helps verify that the system is stable and ready for AI workloads.
191+
The `nvidia-smi` tool reports GPU hardware specifications and provides valuable runtime information, including driver status, temperature, power usage, and GPU utilization. This information helps verify that the system is ready for AI workloads.
192+
193+
The table below provides more explanation of the `nvidia-smi` output:
191194

192-
Understanding the Output of nvidia-smi
193195
| **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** |
194196
|---------------|--------------------------------------|---------------------------------------------|
195-
| **GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. |
196-
| **Driver Version** | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. |
197-
| **CUDA Version** | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. |
198-
| **Architecture / Compute Capability** | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. |
199-
| **Memory** | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. |
200-
| **Power & Thermal Status** | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. |
201-
| **GPU-Utilization** | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. |
202-
| **Memory Usage** | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. |
203-
| **Persistence Mode** | On | Ensures the GPU remains initialized and ready for rapid inference startup. |
197+
| GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. |
198+
| Driver Version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. |
199+
| CUDA Version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. |
200+
| Architecture / Compute Capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. |
201+
| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. |
202+
| Power & Thermal Status | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. |
203+
| GPU-Utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. |
204+
| Memory Usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. |
205+
| Persistence Mode | On | Ensures the GPU remains initialized and ready for rapid inference startup. |
204206

205207

206208
#### Step 3: Check CUDA Toolkit
207209

208-
To build the CUDA version of llama.cpp, the system must have a valid CUDA toolkit installed.
209-
The command ***nvcc --version*** confirms that the CUDA compiler is available and compatible with CUDA 13.
210+
To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed.
211+
212+
The `nvcc --version` command confirms that the CUDA compiler is available and compatible with CUDA 13.
210213
This ensures that CMake can correctly detect and compile the GPU-accelerated components.
211214

212215
```bash
213216
nvcc --version
214217
```
215218

216-
Expected output:
217-
```log
219+
You will see output similar to:
220+
221+
```output
218222
nvcc: NVIDIA (R) Cuda compiler driver
219223
Copyright (c) 2005-2025 NVIDIA Corporation
220224
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
@@ -223,7 +227,7 @@ Build cuda_13.0.r13.0/compiler.36424714_0
223227
```
224228

225229
{{% notice Note %}}
226-
In this Learning Path, the nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
230+
The nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
227231
{{% /notice %}}
228232

229233
This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation.
@@ -234,4 +238,4 @@ At this point, you have verified that:
234238
- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime.
235239
- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp.
236240

237-
Your DGX Spark environment is now fully prepared for the next session, where you will build and configure both CPU and GPU versions of **llama.cpp**, laying the foundation for running quantized LLMs efficiently on the GraceBlackwell platform.
241+
Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform.

0 commit comments

Comments
 (0)