Skip to content

Commit 3cb0e6b

Browse files
Merge pull request #2455 from odincodeshen/feature/gb10_llamacpp
Deploying Quantized LLMs on DGX Spark using llama.cpp
2 parents a72f352 + 111e629 commit 3cb0e6b

File tree

8 files changed

+833
-0
lines changed

8 files changed

+833
-0
lines changed
Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
---
2+
title: Understanding the Grace–Blackwell Architecture for Efficient AI Inference
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Introduction to Grace–Blackwell Architecture
10+
11+
In this session, you will explore the architecture and system design of the **NVIDIA Grace–Blackwell ([DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/))** platform — a next-generation Arm-based CPU–GPU hybrid designed for large-scale AI workloads.
12+
You will also perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions.
13+
14+
The NVIDIA DGX Spark is a personal AI supercomputer designed to bring data center–class AI computing directly to the developer’s desk.
15+
At the heart of DGX Spark lies the NVIDIA GB10 Grace–Blackwell Superchip, a breakthrough architecture that fuses CPU and GPU into a single, unified compute engine.
16+
17+
The **NVIDIA Grace–Blackwell DGX Spark (GB10)** platform combines:
18+
- The NVIDIA **Grace CPU**, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency.
19+
20+
- The NVIDIA **Blackwell GPU**, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads.
21+
- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks.
22+
23+
This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision, making DGX Spark a compact yet powerful development platform for modern AI workloads.
24+
25+
DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere — empowering developers to prototype, fine-tune, and deploy large-scale AI models locally, while seamlessly connecting to the cloud or data center environments when needed.
26+
27+
More information about the NVIDIA DGX Spark can be found in this [blog](https://newsroom.arm.com/blog/arm-nvidia-dgx-spark-high-performance-ai).
28+
29+
30+
### Why Grace–Blackwell for Quantized LLMs?
31+
32+
Quantized Large Language Models (LLMs) — such as those using Q4, Q5, or Q8 precision — benefit enormously from the hybrid architecture of the Grace–Blackwell Superchip.
33+
34+
| **Feature** | **Impact on Quantized LLMs** |
35+
|--------------|------------------------------|
36+
| **Grace CPU (Arm Cortex-X925 / A725)** | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). |
37+
| **Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores)** | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. |
38+
| **High Bandwidth + Low Latency** | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. |
39+
| **Unified 128 GB Memory (NVLink-C2C)** | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. |
40+
| **Energy-Efficient Arm Design** | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. |
41+
42+
43+
In a typical quantized LLM workflow:
44+
- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks.
45+
- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput.
46+
- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space — reducing copy overhead and enabling near-real-time inference.
47+
48+
Together, these features make the GB10 not just a compute platform, but a developer-grade AI laboratory capable of running, profiling, and scaling quantized LLMs efficiently in a desktop form factor.
49+
50+
51+
### Inspecting Your GB10 Environment
52+
53+
Let’s confirm that your environment is ready for the sessions ahead.
54+
55+
#### Step 1: Check CPU information
56+
57+
Run the following commands to confirm CPU readiness:
58+
59+
```bash
60+
lscpu
61+
```
62+
63+
Expected output:
64+
```log
65+
Architecture: aarch64
66+
CPU op-mode(s): 64-bit
67+
Byte Order: Little Endian
68+
CPU(s): 20
69+
On-line CPU(s) list: 0-19
70+
Vendor ID: ARM
71+
Model name: Cortex-X925
72+
Model: 1
73+
Thread(s) per core: 1
74+
Core(s) per socket: 10
75+
Socket(s): 1
76+
Stepping: r0p1
77+
CPU(s) scaling MHz: 89%
78+
CPU max MHz: 4004.0000
79+
CPU min MHz: 1378.0000
80+
BogoMIPS: 2000.00
81+
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
82+
imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
83+
lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
84+
Model name: Cortex-A725
85+
Model: 1
86+
Thread(s) per core: 1
87+
Core(s) per socket: 10
88+
Socket(s): 1
89+
Stepping: r0p1
90+
CPU(s) scaling MHz: 99%
91+
CPU max MHz: 2860.0000
92+
CPU min MHz: 338.0000
93+
BogoMIPS: 2000.00
94+
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
95+
imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
96+
lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
97+
Caches (sum of all):
98+
L1d: 1.3 MiB (20 instances)
99+
L1i: 1.3 MiB (20 instances)
100+
L2: 25 MiB (20 instances)
101+
L3: 24 MiB (2 instances)
102+
NUMA:
103+
NUMA node(s): 1
104+
NUMA node0 CPU(s): 0-19
105+
Vulnerabilities:
106+
Gather data sampling: Not affected
107+
Itlb multihit: Not affected
108+
L1tf: Not affected
109+
Mds: Not affected
110+
Meltdown: Not affected
111+
Mmio stale data: Not affected
112+
Reg file data sampling: Not affected
113+
Retbleed: Not affected
114+
Spec rstack overflow: Not affected
115+
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
116+
Spectre v1: Mitigation; __user pointer sanitization
117+
Spectre v2: Not affected
118+
Srbds: Not affected
119+
Tsx async abort: Not affected
120+
```
121+
122+
The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
123+
124+
The following table summarizes the key specifications of the Grace CPU and explains their relevance to quantized LLM inference.
125+
126+
| **Category** | **Specification** | **Description / Impact for LLM Inference** |
127+
|---------------|-------------------|---------------------------------------------|
128+
| **Architecture** | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. |
129+
| **Core Configuration** | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. |
130+
| **Threads per Core** | 1 | Optimized for deterministic scheduling and predictable latency. |
131+
| **Clock Frequency** | Up to **4.0 GHz** (Cortex-X925)<br>Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
132+
| **Cache Hierarchy** | L1: 1.3 MiB × 20<br>L2: 25 MiB × 20<br>L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. |
133+
| **Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. |
134+
| **NUMA Topology** | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. |
135+
| **Security & Reliability** | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. |
136+
137+
Its **SVE2**, **BF16**, and **INT8 matrix (I8MM)** capabilities make it ideal for **quantized LLM workloads**, providing a stable, power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing.
138+
139+
You can also verify the operating system running on your DGX Spark by using the following command:
140+
141+
```bash
142+
lsb_release -a
143+
```
144+
145+
Expected output:
146+
```log
147+
No LSB modules are available.
148+
Distributor ID: Ubuntu
149+
Description: Ubuntu 24.04.3 LTS
150+
Release: 24.04
151+
Codename: noble
152+
```
153+
As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a modern and developer-friendly Linux distribution.
154+
It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads.
155+
156+
157+
#### Step 2: Verify Blackwell GPU and Driver
158+
159+
After confirming your CPU configuration, you can verify that the **Blackwell GPU** inside the GB10 Grace–Blackwell Superchip is properly detected and ready for CUDA workloads.
160+
161+
```bash
162+
nvidia-smi
163+
```
164+
165+
Expected output:
166+
```log
167+
Wed Oct 22 09:26:54 2025
168+
+-----------------------------------------------------------------------------------------+
169+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
170+
+-----------------------------------------+------------------------+----------------------+
171+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
172+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
173+
| | | MIG M. |
174+
|=========================================+========================+======================|
175+
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
176+
| N/A 32C P8 4W / N/A | Not Supported | 0% Default |
177+
| | | N/A |
178+
+-----------------------------------------+------------------------+----------------------+
179+
180+
+-----------------------------------------------------------------------------------------+
181+
| Processes: |
182+
| GPU GI CI PID Type Process name GPU Memory |
183+
| ID ID Usage |
184+
|=========================================================================================|
185+
| 0 N/A N/A 3094 G /usr/lib/xorg/Xorg 43MiB |
186+
| 0 N/A N/A 3172 G /usr/bin/gnome-shell 16MiB |
187+
+-----------------------------------------------------------------------------------------+
188+
```
189+
190+
The `nvidia-smi` tool not only reports GPU hardware specifications but also provides valuable runtime information — including driver status, temperature, power usage, and GPU utilization — which helps verify that the system is stable and ready for AI workloads.
191+
192+
Understanding the Output of nvidia-smi
193+
| **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** |
194+
|---------------|--------------------------------------|---------------------------------------------|
195+
| **GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. |
196+
| **Driver Version** | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. |
197+
| **CUDA Version** | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. |
198+
| **Architecture / Compute Capability** | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. |
199+
| **Memory** | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. |
200+
| **Power & Thermal Status** | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. |
201+
| **GPU-Utilization** | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. |
202+
| **Memory Usage** | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. |
203+
| **Persistence Mode** | On | Ensures the GPU remains initialized and ready for rapid inference startup. |
204+
205+
206+
#### Step 3: Check CUDA Toolkit
207+
208+
To build the CUDA version of llama.cpp, the system must have a valid CUDA toolkit installed.
209+
The command ***nvcc --version*** confirms that the CUDA compiler is available and compatible with CUDA 13.
210+
This ensures that CMake can correctly detect and compile the GPU-accelerated components.
211+
212+
```bash
213+
nvcc --version
214+
```
215+
216+
Expected output:
217+
```log
218+
nvcc: NVIDIA (R) Cuda compiler driver
219+
Copyright (c) 2005-2025 NVIDIA Corporation
220+
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
221+
Cuda compilation tools, release 13.0, V13.0.88
222+
Build cuda_13.0.r13.0/compiler.36424714_0
223+
```
224+
225+
{{% notice Note %}}
226+
In this Learning Path, the nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
227+
{{% /notice %}}
228+
229+
This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation.
230+
If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121).
231+
232+
At this point, you have verified that:
233+
- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions.
234+
- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime.
235+
- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp.
236+
237+
Your DGX Spark environment is now fully prepared for the next session, where you will build and configure both CPU and GPU versions of **llama.cpp**, laying the foundation for running quantized LLMs efficiently on the Grace–Blackwell platform.

0 commit comments

Comments
 (0)