Skip to content

Commit d8ee52e

Browse files
committed
Update End-to-End RAG Pipeline on Grace–Blackwell
1 parent 12667b0 commit d8ee52e

File tree

4 files changed

+54
-25
lines changed

4 files changed

+54
-25
lines changed

content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ Typical pipeline:
1717

1818
User Query ─> Embedding ─> Vector Search ─> Context ─> Generation ─> Answer
1919

20+
Each stage in this pipeline plays a distinct role in transforming a user’s question into an accurate, context-aware response:
21+
2022
* ***Embedding model*** (e.g., E5-base-v2): Converts text into dense numerical vectors.
2123
* ***Vector database*** (e.g., FAISS): Searches for semantically similar chunks.
2224
* ***Language model*** (e.g., Llama 3.1 8B Instruct – GGUF Q8_0): Generates an answer conditioned on retrieved context.
@@ -26,6 +28,10 @@ More information about RAG system and the challenges of building them can be fou
2628

2729
## Why Grace–Blackwell (GB10)?
2830

31+
The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads.
32+
33+
Its unique CPU–GPU co-design and Unified Memory enable seamless data exchange, making it an ideal foundation for Retrieval-Augmented Generation (RAG) systems that require both fast document retrieval and high-throughput language model inference.
34+
2935
The GB10 platform integrates:
3036
- ***Grace CPU (Arm v9.2)*** – 20 cores (10 × Cortex-X925 + 10 × Cortex-A725)
3137
- ***Blackwell GPU*** – CUDA 13.0 Tensor Core architecture
@@ -71,7 +77,7 @@ Benefits for RAG:
7177
```
7278

7379
To make the concept concrete, this learning path will later demonstrate a small **engineering assistant** example.
74-
The assistant retrieves technical references (e.g., Arm SDK, TensorRT, or OpenCL documentation) and generates helpful explanations for software developers.
80+
The assistant retrieves technical references (e.g., datasheet, programming guide or application note) and generates helpful explanations for software developers.
7581
This use case illustrates how a RAG system can provide **real, contextual knowledge** without retraining the model.
7682

7783
| **Stage** | **Technology / Framework** | **Hardware Execution** | **Function** |
@@ -81,7 +87,7 @@ This use case illustrates how a RAG system can provide **real, contextual knowle
8187
| **Semantic Retrieval** | FAISS + LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. |
8288
| **Text Generation** | llama.cpp REST Server (GGUF model) | Blackwell GPU + Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. |
8389
| **Pipeline Orchestration** | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. |
84-
| **Unified Memory Architecture** | NVLink-C2C Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
90+
| **Unified Memory Architecture** | Unified LPDDR5X Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
8591

8692

8793
## Prerequisites Check
@@ -105,13 +111,16 @@ Expected output:
105111
- ***CUDA Version***: 13.0 (or later)
106112
- ***Driver Version***: 580.95.05
107113

114+
{{% notice Note %}}
115+
If your software version is lower than the one mentioned above, it’s recommended to upgrade the driver before proceeding with the next steps.
116+
{{% /notice %}}
108117

109118
## Wrap-up
110119

111-
In this module, you learned the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture.
112-
You explored how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads.
120+
In this module, you explored the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture.
121+
You examined how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads.
113122

114123
With the conceptual architecture and hardware overview complete, you are now ready to begin hands-on implementation.
115-
In the next module, you will **prepare the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM are functional on the **Grace–Blackwell platform**.
124+
In the next module, you will **set up the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM run correctly on the **Grace–Blackwell** platform.
116125

117-
This marks the transition from **theory to practice** — moving from conceptual RAG fundamentals to building your own hybrid CPU–GPU RAG pipeline.
126+
This marks the transition from **theory to practice** — moving from RAG concepts to building your own **hybrid CPU–GPU pipeline** on Grace–Blackwell.

content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_preparation.md renamed to content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md

Lines changed: 27 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
---
2-
title: Preparing the Environment
2+
title: Setting Up and Validating the RAG Foundation
33
weight: 3
44
layout: "learningpathall"
55
---
66

7-
## Preparing the Environment
7+
## Setting Up and Validating the RAG Foundation
88

99
In the previous session, you verified that your **DGX Spark (GB10)** system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment.
1010

@@ -27,19 +27,20 @@ source rag-venv/bin/activate
2727

2828
# Upgrade pip and install base dependencies
2929
pip install --upgrade pip
30-
pip install sentence-transformers faiss-cpu \
31-
langchain langchain-community langchain-huggingface \
32-
huggingface_hub pypdf cryptography tqdm
30+
pip install torch --index-url https://download.pytorch.org/whl/cpu
31+
pip install transformers==4.46.2 sentence-transformers==2.7.0 faiss-cpu langchain==1.0.5 \
32+
langchain-community langchain-huggingface huggingface_hub \
33+
pypdf tqdm numpy
3334
```
3435

3536
**Why these packages?**
3637
These libraries provide the essential building blocks of the RAG system:
3738
- **sentence-transformers** — used for text embedding with the E5-base-v2 model.
38-
- **FAISS** — enables efficient similarity search for document retrieval.
39+
- **faiss-cpu** — enables efficient similarity search for document retrieval. Since this pipeline runs on the Grace CPU, the CPU version of FAISS is sufficient — GPU acceleration is not required for this stage.
3940
- **LangChain** — manages data orchestration between embedding, retrieval, and generation.
4041
- **huggingface_hub** — handles model download and authentication.
4142
- **pypdf** — extracts and processes text content from documents.
42-
- **cryptography** and **tqdm** — provide secure dependencies and progress visualization.
43+
- **tqdm** — provide progress visualization.
4344

4445

4546
Check installation:
@@ -59,7 +60,7 @@ FAISS GPU: False
5960

6061
## Step 2 – Model Preparation
6162

62-
Download and organize the models required for the **GB10 Local RAG Blueprint**:
63+
Download and organize the models required for the **GB10 Local RAG Pipeline**:
6364

6465
- **LLM (Large Language Model)** — llama-3-8b-instruct for text generation.
6566
- **Embedding Model** — E5-base-v2 for document vectorization.
@@ -77,7 +78,9 @@ hf download intfloat/e5-base-v2 --local-dir ~/models/e5-base-v2
7778
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -P ~/models/Llama-3.1-8B-gguf
7879
```
7980

80-
Run a short Python script to verify that the **E5-base-v2** model loads correctly and can generate embeddings.
81+
### Verify the **E5-base-v2** model
82+
83+
Run a Python script to verify that the **E5-base-v2** model loads correctly and can generate embeddings.
8184

8285
```bash
8386
from sentence_transformers import SentenceTransformer
@@ -113,7 +116,18 @@ First vector snippet: [-0.012 -0.0062 -0.0008 -0.0014 0.026 -0.0066 -0.0173
113116
-0.0455]
114117
```
115118
116-
A successful output confirms that the E5-base-v2 embedding model is functional and ready for use on the Grace CPU.
119+
Interpret the E5-base-v2 Result:
120+
121+
- ***Test sentences***: The two example sentences are used to confirm that the model can process text input and generate embeddings correctly. If this step succeeds, it means the model’s tokenizer, encoder, and PyTorch runtime on the Grace CPU are all working together properly.
122+
- ***Embedding shape (2, 768)***: The two sentences were converted into two 768-dimensional embedding vectors — 768 is the hidden dimension size of this model.
123+
- ***First vector snippet***: Displays the first 10 values of the first embedding vector. Each number represents a learned feature extracted from the text.
124+
125+
A successful output confirms that the ***E5-base-v2 embedding model*** is functional and ready for use on the Grace CPU.
126+
127+
128+
### Verify the **llama-3.1-8B** model
129+
130+
Then, you are going to verify the gguf model.
117131
118132
The **llama.cpp** runtime will be used for text generation.
119133
Please ensure that both the **CPU** and **GPU** builds have been installed following the previous [learning path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu/).
@@ -390,7 +404,7 @@ results = db.similarity_search(query, k=3)
390404
for i, r in enumerate(results, 1):
391405
print(f"\nResult {i}")
392406
print(f"Source: {r.metadata.get('source')}")
393-
print(r.page_content[:300], "..."
407+
print(r.page_content[:300], "...")
394408
395409
query = "Use SWD debug Raspberry Pi Pico"
396410
results = db.similarity_search(query, k=3)
@@ -450,9 +464,9 @@ The execution of `check_index.py` confirmed that your local ***FAISS vector inde
450464
451465
You performed two distinct queries targeting different product lines within the Raspberry Pi ecosystem: ***Raspberry Pi 4 power supply*** and ***Raspberry Pi Pico SWD debugging***.
452466
453-
For the first query, ***raspberry pi 4 power supply***, the system returned three highly relevant results, all sourced from the `cm4io-datasheet.txt` file. These passages provided technical guidance on power requirements, supply voltage ranges, and hardware configurations specific to the Compute Module 4 IO Board. This indicates that the embeddings captured the correct semantic intent, and that the FAISS index correctly surfaced content even when specific keywords like ***power supply*** appeared in varied contexts.
467+
- For the first query, ***raspberry pi 4 power supply***, the system returned three highly relevant results, all sourced from the `cm4io-datasheet.txt` file. These passages provided technical guidance on power requirements, supply voltage ranges, and hardware configurations specific to the Compute Module 4 IO Board. This indicates that the embeddings captured the correct semantic intent, and that the FAISS index correctly surfaced content even when specific keywords like ***power supply*** appeared in varied contexts.
454468
455-
For the second query, ***Use SWD debug Raspberry Pi Pico***, the search retrieved top results from all three relevant datasheets in the Pico family: `pico-datasheet.txt`, `pico-2-datasheet.txt`, and `pico-w-datasheet.txt`.
469+
- For the second query, ***Use SWD debug Raspberry Pi Pico***, the search retrieved top results from all three relevant datasheets in the Pico family: `pico-datasheet.txt`, `pico-2-datasheet.txt`, and `pico-w-datasheet.txt`.
456470
The extracted passages consistently explained how the ***Serial Wire Debug (SWD)*** port allows developers to reset the system, load and run code without manual input, and perform interactive debugging on the RP2040 or RP2350 microcontrollers. This demonstrates that your chunking and indexing pipeline accurately retained embedded debugging context, and that metadata mapping correctly links each result to its original source document.
457471
458472
This process validates that your system can perform semantic retrieval on technical documents — a core capability of any RAG application.
@@ -469,7 +483,6 @@ In summary, both semantic queries were successfully answered using your local ve
469483
| Orchestration | Python RAG Script | Grace CPU | Pipeline control |
470484
| Unified Memory | NVLink-C2C | Shared | Zero-copy data exchange |
471485
472-
473486
At this point, your environment is fully configured and validated.
474487
You have confirmed that the E5-base-v2 embedding model, FAISS index, and Llama 3.1 8B model are all functioning correctly.
475488

content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,11 @@ layout: "learningpathall"
66

77
## Integrating Retrieval and Generation
88

9+
In the previous modules, you prepared the environment, validated the ***E5-base-v2*** embedding model, and verified that the ***Llama 3.1 8B*** Instruct model runs successfully on the ***Grace–Blackwell (GB10)*** platform.
10+
11+
In this module, you will bring all components together to build a complete ***Retrieval-Augmented Generation*** (RAG) workflow.
12+
This stage connects the ***CPU-based retrieval and indexing*** with ***GPU-accelerated language generation***, creating an end-to-end system capable of answering technical questions using real documentation data.
13+
914
Building upon the previous modules, you will now:
1015
- Connect the **E5-base-v2** embedding model and FAISS vector index.
1116
- Integrate the **llama.cpp** REST server for GPU-accelerated inference.
@@ -179,7 +184,7 @@ Follow the previous (learning path) [https://learn.arm.com/learning-paths/laptop
179184
180185
![image1 CPU–GPU Utilization screenshot](rag_utilization.jpeg "CPU–GPU Utilization")
181186
182-
The figure illustrates how the ***Grace CPU*** and ***Blackwell GPU*** collaborate during ***RAG** execution.
187+
The figure above illustrates how the ***Grace CPU*** and ***Blackwell GPU*** collaborate during ***RAG** execution.
183188
On the left, the GPU utilization graph shows a clear spike reaching ***96%***, indicating that the llama.cpp inference engine is actively generating tokens on the GPU.
184189
Meanwhile, on the right, the htop panel shows multiple Python processes (rag_query_rest.py) running on a single Grace CPU core, maintaining around 93% per-core utilization.
185190

content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ cascade:
77

88
minutes_to_complete: 60
99

10-
who_is_this_for: This learning path teaches how a Retrieval-Augmented Generation (RAG) pipeline operates efficiently in a hybrid CPU–GPU environment on the Grace–Blackwell (GB10) platform. Learners will explore how Arm-based Grace CPUs perform document retrieval and orchestration, while Blackwell GPUs handle language model inference through the open-source llama.cpp REST Server.
10+
who_is_this_for: This learning path is designed for developers and engineers who want to understand and implement a Retrieval-Augmented Generation (RAG) pipeline optimized for the Grace–Blackwell (GB10) platform. It is ideal for those interested in exploring how Arm-based Grace CPUs manage local document retrieval and orchestration, while Blackwell GPUs accelerate large language model inference through the open-source llama.cpp REST Server. By the end, learners will understand how to build an efficient hybrid CPU–GPU RAG system that leverages Unified Memory for seamless data sharing between computation layers.
1111

1212
learning_objectives:
1313
- Understand how a RAG system combines document retrieval and language model generation.
@@ -17,6 +17,7 @@ learning_objectives:
1717

1818
prerequisites:
1919
- One NVIDIA DGX Spark system with at least 15 GB of available disk space.
20+
- Follow the previous [Learning Path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to install both the CPU and GPU builds of llama.cpp.
2021

2122
author: Odin Shen
2223

@@ -44,9 +45,10 @@ further_reading:
4445
link: https://github.com/NVIDIA/dgx-spark-playbooks
4546
type: documentation
4647
- resource:
47-
title: Arm Blog Post
48-
link: https://newsroom.arm.com/blog/arm-powered-nvidia-dgx-spark-ai-workstations
49-
type: Blog
48+
title: Arm Learning Path
49+
link: https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/
50+
type: Learning Path
51+
5052

5153
### FIXED, DO NOT MODIFY
5254
# ================================================================================

0 commit comments

Comments
 (0)