Skip to content

Commit 1576f0d

Browse files
Merge pull request #2576 from madeline-underwood/blackwell
Blackwell_JA to sign off
2 parents adcdd70 + be872df commit 1576f0d

File tree

6 files changed

+46
-48
lines changed

6 files changed

+46
-48
lines changed

content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,20 @@
11
---
2-
title: Understanding RAG on Grace–Blackwell (GB10)
2+
title: Explore building a RAG pipeline on Arm-based Grace–Blackwell systems
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Before you start
9+
## Get started
1010

11-
Before starting this Learning Path, you should complete [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to learn about the CPU and GPU builds of llama.cpp. This background is recommended for building the RAG solution on llama.cpp.
11+
Before getting started, you should complete the Learning Path [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to learn about the CPU and GPU builds of llama.cpp. This background is recommended for building the RAG solution on llama.cpp.
1212

1313
The NVIDIA DGX Spark is also referred to as the Grace-Blackwell platform or GB10, the name of the NVIDIA Grace-Blackwell Superchip.
1414

1515
## What is RAG?
1616

17-
Retrieval-Augmented Generation (RAG) combines information retrieval with language-model generation.
18-
Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses.
17+
Retrieval-Augmented Generation (RAG) combines information retrieval with language-model generation. Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses.
1918

2019
Here is a typical pipeline:
2120

@@ -35,9 +34,9 @@ Its unique CPU–GPU design and unified memory enable seamless data exchange, ma
3534

3635
The GB10 platform includes:
3736

38-
- Grace CPU (Armv9.2 architecture) 20 cores including 10 Cortex-X925 cores and 10 Cortex-A725 cores
39-
- Blackwell GPU CUDA 13.0 Tensor Core architecture
40-
- Unified Memory (128 GB NVLink-C2C) Shared address space between CPU and GPU which allows both processors to access the same 128 GB unified memory region without copy operations.
37+
- Grace CPU (Armv9.2 architecture) - 20 cores including 10 Cortex-X925 cores and 10 Cortex-A725 cores
38+
- Blackwell GPU - CUDA 13.0 Tensor Core architecture
39+
- Unified Memory (128 GB NVLink-C2C) - Shared address space between CPU and GPU which allows both processors to access the same 128 GB unified memory region without copy operations.
4140

4241
The GB10 provides the following benefits for RAG applications:
4342

@@ -51,7 +50,7 @@ The GB10 provides the following benefits for RAG applications:
5150
Here is a diagram of the architecture:
5251

5352
```console
54-
.
53+
.
5554
┌─────────────────────────────────────┐
5655
│ User Query │
5756
└──────────────┬──────────────────────┘
@@ -102,7 +101,7 @@ The technology stack you will use is listed below:
102101
| Unified Memory Architecture | Unified LPDDR5X shared memory | Grace CPU and Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
103102

104103

105-
## Prerequisites Check
104+
## Check your setup
106105

107106
Before starting, run the following commands to confirm your hardware is ready:
108107

content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md renamed to content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_setup.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Configure your development environment and prepare models
2+
title: Configure the RAG development environment and models
33
weight: 3
44
layout: "learningpathall"
55
---
@@ -80,11 +80,11 @@ hf download intfloat/e5-base-v2 --local-dir ~/models/e5-base-v2
8080
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -P ~/models/Llama-3.1-8B-gguf
8181
```
8282

83-
### Verify the e5-base-v2 model
83+
## Verify the e5-base-v2 model
8484

8585
Run a Python script to verify that the e5-base-v2 model loads correctly and can generate embeddings.
8686

87-
Save the code below in a text file named `vector-test.py`.
87+
Save the code below in a text file named `vector-test.py`:
8888

8989
```bash
9090
from sentence_transformers import SentenceTransformer
@@ -136,7 +136,7 @@ The e5-base-v2 results show:
136136
137137
A successful output confirms that the e5-base-v2 embedding model is functional and ready for use.
138138
139-
### Verify the Llama 3.1 model
139+
## Verify the Llama 3.1 model
140140
141141
The llama.cpp runtime will be used for text generation using the Llama 3.1 model.
142142

content/learning-paths/laptops-and-desktops/dgx_spark_rag/2b_rag_setup.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
---
2-
title: Add documents to the vector database
2+
title: Add documents to the RAG vector database
33
weight: 4
44
layout: "learningpathall"
55
---
66

7-
## Prepare a sample document corpus
7+
## Prepare a sample document corpus for RAG
88

99
You are now ready to add your documents to the RAG database that will be used for retrieval and reasoning.
1010

1111
This converts your raw knowledge documents into clean, chunked text segments that can later be vectorized and indexed by FAISS.
1212

13-
## Understanding FAISS for vector search
13+
## Use FAISS for efficient vector search on Arm
1414

1515
FAISS (Facebook AI Similarity Search) is an open-source library developed by Meta AI for efficient similarity search and clustering of dense vectors. It's particularly well-suited for RAG applications because it can quickly find the most relevant document chunks from large collections.
1616

@@ -21,7 +21,7 @@ Key advantages of FAISS for this application:
2121
- Speed: Uses advanced indexing algorithms to perform nearest-neighbor searches in milliseconds
2222
- Flexibility: Supports multiple distance metrics (L2, cosine similarity) and index types
2323

24-
### Create a workspace and data folder
24+
## Set up your RAG workspace and data folder
2525

2626
Create a directory structure for your data:
2727

@@ -57,7 +57,7 @@ Use `wget` to batch download all the PDFs into `~/rag/pdf`.
5757
wget -P ~/rag/pdf -i datasheet.txt
5858
```
5959

60-
### Convert PDF into txt file
60+
## Convert PDF documents to text files
6161

6262
Then, create a Python file named `pdf2text.py` with the code below:
6363

@@ -109,7 +109,7 @@ At the end of the output you see:
109109
Total converted PDFs: 12
110110
```
111111

112-
### Verify your corpus
112+
## Verify your document corpus
113113

114114
You should now see a number of files in your folder. Run the command below to inspect the results:
115115

@@ -119,7 +119,7 @@ find ~/rag/text/ -type f -name "*.txt" -exec cat {} + | wc -l
119119

120120
It will show how many lines are in total. The number is around 100,000.
121121

122-
## Build an Embedding and Search Index
122+
## Build an embedding and search index with FAISS
123123

124124
Convert your prepared text corpus into vector embeddings and store them in a FAISS index for efficient semantic search.
125125

@@ -133,7 +133,7 @@ This stage enables your RAG pipeline to retrieve the most relevant text chunks w
133133

134134
Use e5-base-v2 to encode the documents and create a FAISS vector index.
135135

136-
### Create the FAISS builder script
136+
## Create and run the FAISS builder script
137137

138138

139139
```bash

content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
---
2-
title: Implementing the RAG pipeline
2+
title: Build and run the RAG pipeline
33
weight: 5
44
layout: "learningpathall"
55
---
66

7-
## Integrating retrieval and generation
7+
## Integrate retrieval and generation on Arm
88

99
In the previous sections, you prepared the environment, validated the e5-base-v2 embedding model, and verified that the Llama 3.1 8B Instruct model runs successfully on the Grace–Blackwell (GB10) platform.
1010

@@ -17,7 +17,7 @@ Building upon the previous modules, you will now:
1717
- Integrate the llama.cpp REST server for GPU-accelerated inference.
1818
- Execute a complete Retrieval-Augmented Generation (RAG) workflow for end-to-end question answering.
1919

20-
### Start the llama.cpp REST server
20+
## Start the llama.cpp REST server
2121

2222
Before running the RAG query script, ensure the LLM server is active by running:
2323

@@ -41,7 +41,7 @@ The output is:
4141
{"status":"ok"}
4242
```
4343

44-
### Create the RAG query script
44+
## Create the RAG query script
4545

4646
This script performs the full pipeline using the flow:
4747

@@ -185,7 +185,7 @@ This demonstrates that the RAG system correctly retrieved relevant sources and g
185185
186186
You can reference the section 5.1.2 on the PDF to verify the result.
187187
188-
### Observe CPU and GPU utilization
188+
## Observe CPU and GPU utilization
189189
190190
If you have installed `htop` and `nvtop`, you can observe CPU and GPU utilization.
191191

content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
11
---
2-
title: Observe unified memory performance
2+
title: Monitor unified memory performance
33
weight: 6
44
layout: "learningpathall"
55
---
66

77
## Observe unified memory performance
88

9-
In this section, you will observe how the Grace CPU and Blackwell GPU share data through unified memory during RAG execution.
9+
In this section, you will learn how to monitor unified memory performance and GPU utilization on Grace–Blackwell systems during Retrieval-Augmented Generation (RAG) AI workloads. By observing real-time system memory and GPU activity, you will verify zero-copy data sharing and efficient hybrid AI inference enabled by the Grace–Blackwell unified memory architecture.
10+
11+
12+
You will start from an idle system state, then progressively launch the RAG model server and run a query, while monitoring both system memory and GPU activity from separate terminals. This hands-on experiment demonstrates how unified memory enables both the Grace CPU and Blackwell GPU to access the same memory space without data movement, optimizing AI inference performance.
1013

1114
You will start from an idle system state, then progressively launch the model server and run a query, while monitoring both system memory and GPU activity from separate terminals.
1215

@@ -21,11 +24,12 @@ Open two terminals on your GB10 system and use them as listed in the table below
2124

2225
You should also have your original terminals open that you used to run the `llama-server` and the RAG queries in the previous section. You will run these again and use the two new terminals for observation.
2326

24-
### Prepare for the experiments
27+
28+
## Prepare for unified memory observation
2529

2630
Ensure the RAG pipeline is stopped before starting the observation.
2731

28-
#### Terminal 1 - system memory observation
32+
### Terminal 1:system memory observation
2933

3034
Run the Bash commands below in terminal 1 to print the free memory of the system:
3135

@@ -52,7 +56,7 @@ The printed fields are:
5256
- `free` — Memory not currently allocated or reserved by the system.
5357
- `available` — Memory immediately available for new processes, accounting for reclaimable cache and buffers.
5458

55-
#### Terminal 2 GPU status observation
59+
### Terminal 2: GPU status observation
5660

5761
Run the Bash commands below in terminal 2 to print the GPU statistics:
5862

@@ -85,7 +89,7 @@ Here is an explanation of the fields:
8589
| `memory.used` | GPU VRAM usage | GB10 does not include separate VRAM; all data resides within Unified Memory |
8690

8791

88-
### Run the llama-server
92+
## Run the llama-server
8993

9094
With the idle condition understood, start the `llama.cpp` REST server again in your original terminal, not the two new terminals being used for observation.
9195

@@ -134,7 +138,7 @@ The output in monitor terminal 2 is similar to:
134138
This confirms the model is resident in unified memory, which is visible by the increased system RAM usage.
135139

136140

137-
## Execute the RAG Query
141+
## Execute the RAG query
138142

139143
With the observation code and the `llama-server` still running, run the RAG query in another terminal:
140144

@@ -196,7 +200,7 @@ The GPU executes compute kernels with GPU utilization at 96%, without reading fr
196200

197201
The `utilization.memory=0` and `memory.used=[N/A]` metrics are clear signs that data sharing, not data copying, is happening.
198202

199-
### Observe and interpret unified memory behavior
203+
## Interpret unified memory behavior
200204

201205
This experiment confirms the Grace–Blackwell Unified Memory architecture in action:
202206
- The CPU and GPU share the same address space.
@@ -207,7 +211,7 @@ Data does not move — computation moves to the data.
207211

208212
The Grace CPU orchestrates retrieval, and the Blackwell GPU performs generation, both operating within the same Unified Memory pool.
209213

210-
### Summary of unified memory behavior
214+
## Summary of unified memory behavior
211215

212216
| **Observation** | **Unified Memory Explanation** |
213217
|----------------------------------------------------|----------------------------------------------------------|

content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,17 @@
11
---
2-
title: Build a RAG pipeline on NVIDIA DGX Spark
3-
4-
draft: true
5-
cascade:
6-
draft: true
7-
2+
title: Build a RAG pipeline on Arm-based NVIDIA DGX Spark
83
minutes_to_complete: 60
94

10-
who_is_this_for: This is an advanced topic for developers who want to understand and implement a Retrieval-Augmented Generation (RAG) pipeline on the NVIDIA DGX Spark platform. It is ideal for those interested in exploring how Arm-based Grace CPUs manage local document retrieval and orchestration, while Blackwell GPUs accelerate large language model inference through the open-source llama.cpp REST server.
5+
who_is_this_for: This is an advanced topic for developers who want to build a Retrieval-Augmented Generation (RAG) pipeline on the NVIDIA DGX Spark platform. You'll learn how Arm-based Grace CPUs handle document retrieval and orchestration, while Blackwell GPUs speed up large language model inference using the open-source llama.cpp REST server. This is a great fit if you're interested in combining Arm CPU management with GPU-accelerated AI workloads.
116

127
learning_objectives:
13-
- Understand how a RAG system combines document retrieval and language model generation.
14-
- Deploy a hybrid CPUGPU RAG pipeline on the GB10 platform using open-source tools.
15-
- Use the llama.cpp REST Server for GPU-accelerated inference with CPU-managed retrieval.
16-
- Build a reproducible RAG application that demonstrates efficient hybrid computing.
8+
- Describe how a RAG system combines document retrieval and language model generation
9+
- Deploy a hybrid CPU-GPU RAG pipeline on the GB10 platform using open-source tools
10+
- Use the llama.cpp REST Server for GPU-accelerated inference with CPU-managed retrieval
11+
- Build a reproducible RAG application that demonstrates efficient hybrid computing
1712

1813
prerequisites:
19-
- An NVIDIA DGX Spark system with at least 15 GB of available disk space.
14+
- An NVIDIA DGX Spark system with at least 15 GB of available disk space
2015

2116
author: Odin Shen
2217

0 commit comments

Comments
 (0)