You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md
+61-57Lines changed: 61 additions & 57 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,67 +1,67 @@
1
1
---
2
-
title: Understanding the Grace–Blackwell Architecture for Efficient AI Inference
2
+
title: Verify GraceBlackwell system readiness for AI inference
3
3
weight: 2
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Introduction to Grace–Blackwell Architecture
9
+
## Introduction to Grace Blackwell architecture
10
+
11
+
In this session, you will explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU–GPU hybrid for large-scale AI workloads.
10
12
11
-
In this session, you will explore the architecture and system design of the **NVIDIA Grace–Blackwell ([DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/))** platform — a next-generation Arm-based CPU–GPU hybrid designed for large-scale AI workloads.
12
13
You will also perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions.
13
14
14
-
The NVIDIA DGX Spark is a personal AI supercomputer designed to bring data center–class AI computing directly to the developer’s desk.
15
-
At the heart of DGX Spark lies the NVIDIA GB10 Grace–Blackwell Superchip, a breakthrough architecture that fuses CPU and GPU into a single, unified compute engine.
15
+
The NVIDIA DGX Spark is a personal AI supercomputer that brings data center–class AI computing directly to the developer desktop.
16
+
The NVIDIA GB10 GraceBlackwell Superchipfuses CPU and GPU into a single unified compute engine.
16
17
17
-
The **NVIDIA Grace–Blackwell DGX Spark (GB10)** platform combines:
18
-
- The NVIDIA **Grace CPU**, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency.
18
+
The NVIDIA GraceBlackwell DGX Spark (GB10) platform combines:
19
+
- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency.
19
20
20
-
- The NVIDIA **Blackwell GPU**, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads.
21
+
- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads.
21
22
- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks.
22
23
23
-
This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision, making DGX Spark a compact yet powerful development platform for modern AI workloads.
24
-
25
-
DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere — empowering developers to prototype, fine-tune, and deploy large-scale AI models locally, while seamlessly connecting to the cloud or data center environments when needed.
24
+
This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision.
25
+
DGX Spark is a compact yet powerful development platform for modern AI workloads.
26
26
27
-
More information about the NVIDIA DGX Spark can be found in this [blog](https://newsroom.arm.com/blog/arm-nvidia-dgx-spark-high-performance-ai).
27
+
DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere, empowering developers to prototype, fine-tune, and deploy large-scale AI models locally while seamlessly connecting to cloud or data-center environments when needed.
28
28
29
+
### Why Grace Blackwell for quantized LLMs?
29
30
30
-
### Why Grace–Blackwell for Quantized LLMs?
31
-
32
-
Quantized Large Language Models (LLMs) — such as those using Q4, Q5, or Q8 precision — benefit enormously from the hybrid architecture of the Grace–Blackwell Superchip.
31
+
Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip.
33
32
34
33
|**Feature**|**Impact on Quantized LLMs**|
35
34
|--------------|------------------------------|
36
-
|**Grace CPU (Arm Cortex-X925 / A725)**| Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). |
37
-
|**Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores)**| Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. |
|**Unified 128 GB Memory (NVLink-C2C)**| CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. |
40
-
|**Energy-Efficient Arm Design**| Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. |
35
+
| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). |
36
+
| Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. |
37
+
| High Bandwidth + Low Latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. |
38
+
| Unified 128 GB Memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. |
39
+
| Energy-Efficient Arm Design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. |
41
40
42
41
43
42
In a typical quantized LLM workflow:
44
43
- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks.
45
44
- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput.
46
45
- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space — reducing copy overhead and enabling near-real-time inference.
47
46
48
-
Together, these features make the GB10 not just a compute platform, but a developer-grade AI laboratory capable of running, profiling, and scaling quantized LLMs efficiently in a desktop form factor.
47
+
Together, these features make the GB10 a developer-grade AI laboratory for running, profiling, and scaling quantized LLMs efficiently in a desktop form factor.
49
48
50
49
51
-
### Inspecting Your GB10 Environment
50
+
### Inspecting your GB10 environment
52
51
53
-
Let’s confirm that your environment is ready for the sessions ahead.
52
+
Let's verify that your DGX Spark system is configured and ready for building and running quantized LLMs.
54
53
55
54
#### Step 1: Check CPU information
56
55
57
-
Run the following commands to confirm CPU readiness:
56
+
Run the following command to print the CPU information:
58
57
59
58
```bash
60
59
lscpu
61
60
```
62
61
63
62
Expected output:
64
-
```log
63
+
64
+
```output
65
65
Architecture: aarch64
66
66
CPU op-mode(s): 64-bit
67
67
Byte Order: Little Endian
@@ -125,16 +125,16 @@ The following table summarizes the key specifications of the Grace CPU and expla
125
125
126
126
|**Category**|**Specification**|**Description / Impact for LLM Inference**|
|**Architecture**| Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. |
129
-
|**Core Configuration**| 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. |
130
-
|**Threads per Core**| 1 | Optimized for deterministic scheduling and predictable latency. |
131
-
|**Clock Frequency**| Up to **4.0 GHz** (Cortex-X925)<br>Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
132
-
|**Cache Hierarchy**| L1: 1.3 MiB × 20<br>L2: 25 MiB × 20<br>L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. |
133
-
|**Instruction Set Features**| SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. |
134
-
|**NUMA Topology**| Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. |
135
-
|**Security & Reliability**| Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. |
128
+
| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. |
129
+
| Core Configuration | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. |
130
+
| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency. |
131
+
| Clock Frequency | Up to **4.0 GHz** (Cortex-X925)<br>Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
132
+
| Cache Hierarchy | L1: 1.3 MiB × 20<br>L2: 25 MiB × 20<br>L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. |
133
+
| Instruction Set Features**| SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. |
134
+
| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. |
135
+
| Security & Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. |
136
136
137
-
Its **SVE2**, **BF16**, and **INT8 matrix (I8MM)** capabilities make it ideal for **quantized LLM workloads**, providing a stable, power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing.
137
+
Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities make it ideal for quantized LLM workloads, providing power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing.
138
138
139
139
You can also verify the operating system running on your DGX Spark by using the following command:
140
140
@@ -143,27 +143,28 @@ lsb_release -a
143
143
```
144
144
145
145
Expected output:
146
+
146
147
```log
147
148
No LSB modules are available.
148
149
Distributor ID: Ubuntu
149
150
Description: Ubuntu 24.04.3 LTS
150
151
Release: 24.04
151
152
Codename: noble
152
153
```
153
-
As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a modern and developer-friendly Linux distribution.
154
+
As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution.
154
155
It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads.
155
156
157
+
#### Step 2: Verify Blackwell GPU and driver
156
158
157
-
#### Step 2: Verify Blackwell GPU and Driver
158
-
159
-
After confirming your CPU configuration, you can verify that the **Blackwell GPU** inside the GB10 Grace–Blackwell Superchip is properly detected and ready for CUDA workloads.
159
+
After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads.
The `nvidia-smi` tool not only reports GPU hardware specifications but also provides valuable runtime information — including driver status, temperature, power usage, and GPU utilization — which helps verify that the system is stable and ready for AI workloads.
191
+
The `nvidia-smi` tool reports GPU hardware specifications and provides valuable runtime information, including driver status, temperature, power usage, and GPU utilization. This information helps verify that the system is ready for AI workloads.
192
+
193
+
The table below provides more explanation of the `nvidia-smi` output:
191
194
192
-
Understanding the Output of nvidia-smi
193
195
|**Category**|**Specification (from nvidia-smi)**|**Description / Impact for LLM Inference**|
|**GPU Name**| NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. |
196
-
|**Driver Version**| 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. |
197
-
|**CUDA Version**| 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. |
198
-
|**Architecture / Compute Capability**| Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. |
199
-
|**Memory**| Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. |
200
-
|**Power & Thermal Status**|~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. |
201
-
|**GPU-Utilization**| 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. |
202
-
|**Memory Usage**| Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. |
203
-
|**Persistence Mode**| On | Ensures the GPU remains initialized and ready for rapid inference startup. |
197
+
| GPU Name**| NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. |
198
+
| Driver Version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. |
199
+
| CUDA Version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. |
| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. |
202
+
| Power & Thermal Status |~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. |
203
+
| GPU-Utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. |
204
+
| Memory Usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. |
205
+
| Persistence Mode | On | Ensures the GPU remains initialized and ready for rapid inference startup. |
204
206
205
207
206
208
#### Step 3: Check CUDA Toolkit
207
209
208
-
To build the CUDA version of llama.cpp, the system must have a valid CUDA toolkit installed.
209
-
The command ***nvcc --version*** confirms that the CUDA compiler is available and compatible with CUDA 13.
210
+
To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed.
211
+
212
+
The `nvcc --version` command confirms that the CUDA compiler is available and compatible with CUDA 13.
210
213
This ensures that CMake can correctly detect and compile the GPU-accelerated components.
In this Learning Path, the nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
230
+
The nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
227
231
{{% /notice %}}
228
232
229
233
This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation.
@@ -234,4 +238,4 @@ At this point, you have verified that:
234
238
- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime.
235
239
- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp.
236
240
237
-
Your DGX Spark environment is now fully prepared for the next session, where you will build and configure both CPU and GPU versions of **llama.cpp**, laying the foundation for running quantized LLMs efficiently on the Grace–Blackwell platform.
241
+
Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the GraceBlackwell platform.
0 commit comments