Skip to content

Commit 9cf563c

Browse files
committed
Tech review of RAG pipeline on DGX Spark
1 parent 10f122f commit 9cf563c

File tree

6 files changed

+606
-509
lines changed

6 files changed

+606
-509
lines changed

content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md

Lines changed: 61 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -6,48 +6,54 @@ weight: 2
66
layout: learningpathall
77
---
88

9-
## What is RAG?
9+
## Before you start
10+
11+
Before starting this Learning Path, you should complete [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to learn about the CPU and GPU builds of llama.cpp. This background is recommended for building the RAG solution on llama.cpp.
12+
13+
The NVIDIA DGX Spark is also referred to as the Grace-Blackwell platform or GB10, the name of the NVIDIA Grace-Blackwell Superchip.
1014

11-
This module provides the conceptual foundation for how Retrieval-Augmented Generation operates on the ***Grace–Blackwell*** (GB10) platform before you begin building the system in the next steps.
15+
## What is RAG?
1216

13-
**Retrieval-Augmented Generation (RAG)** combines information retrieval with language-model generation.
17+
Retrieval-Augmented Generation (RAG) combines information retrieval with language-model generation.
1418
Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses.
1519

16-
Typical pipeline:
20+
Here is a typical pipeline:
1721

1822
User Query ─> Embedding ─> Vector Search ─> Context ─> Generation ─> Answer
1923

20-
Each stage in this pipeline plays a distinct role in transforming a user’s question into an accurate, context-aware response:
24+
Each stage in this pipeline plays a distinct role in transforming a question into a context-aware response:
2125

22-
* ***Embedding model*** (e.g., E5-base-v2): Converts text into dense numerical vectors.
23-
* ***Vector database*** (e.g., FAISS): Searches for semantically similar chunks.
24-
* ***Language model*** (e.g., Llama 3.1 8B Instruct – GGUF Q8_0): Generates an answer conditioned on retrieved context.
26+
* Embedding model: Converts text into dense numerical vectors. An example is e5-base-v2.
27+
* Vector database: Searches for semantically similar chunks. An example is FAISS.
28+
* Language model: Generates an answer conditioned on retrieved context. An example is Llama 3.1 8B Instruct.
2529

26-
More information about RAG system and the challenges of building them can be found in this [learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/copilot-extension/1-rag/)
30+
## Why is Grace–Blackwell good for RAG pipelines?
2731

32+
The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads.
2833

29-
## Why Grace–Blackwell (GB10)?
34+
Its unique CPU–GPU design and unified memory enable seamless data exchange, making it an ideal foundation for RAG systems that require both fast document retrieval and high-throughput language model inference.
3035

31-
The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads.
36+
The GB10 platform includes:
3237

33-
Its unique CPU–GPU co-design and Unified Memory enable seamless data exchange, making it an ideal foundation for Retrieval-Augmented Generation (RAG) systems that require both fast document retrieval and high-throughput language model inference.
38+
- Grace CPU (Armv9.2 architecture) – 20 cores including 10 Cortex-X925 cores and 10 Cortex-A725 cores
39+
- Blackwell GPU – CUDA 13.0 Tensor Core architecture
40+
- Unified Memory (128 GB NVLink-C2C) – Shared address space between CPU and GPU which allows both processors to access the same 128 GB unified memory region without copy operations.
3441

35-
The GB10 platform integrates:
36-
- ***Grace CPU (Arm v9.2)*** – 20 cores (10 × Cortex-X925 + 10 × Cortex-A725)
37-
- ***Blackwell GPU*** – CUDA 13.0 Tensor Core architecture
38-
- ***Unified Memory (128 GB NVLink-C2C)*** – Shared address space between CPU and GPU. The shared NVLink-C2C interface allows both processors to access the same 128 GB Unified Memory region without copy operations — a key feature validated later in Module 4.
42+
The GB10 provides the following benefits for RAG applications:
3943

40-
Benefits for RAG:
41-
- ***Hybrid execution*** – Grace CPU efficiently handles embedding, indexing, and API orchestration.
42-
- ***GPU acceleration*** – Blackwell GPU performs token generation with low latency.
43-
- ***Unified memory*** – Eliminates CPU↔GPU copy overhead; tensors and document vectors share the same memory region.
44-
- ***Open-source friendly*** – Works natively with PyTorch, FAISS, Transformers, and FastAPI.
44+
- Hybrid execution – Grace CPU efficiently handles embedding, indexing, and API orchestration.
45+
- GPU acceleration – Blackwell GPU performs token generation with low latency.
46+
- Unified memory – Eliminates CPU to GPU copy overhead because tensors and document vectors share the same memory region.
47+
- Open-source friendly – Works natively with PyTorch, FAISS, Transformers, and FastAPI.
4548

46-
## Conceptual Architecture
49+
## RAG system architecture
4750

48-
```
51+
Here is a diagram of the architecture:
52+
53+
```console
54+
.
4955
┌─────────────────────────────────────┐
50-
User Query
56+
User Query
5157
└──────────────┬──────────────────────┘
5258
5359
@@ -76,51 +82,56 @@ Benefits for RAG:
7682

7783
```
7884

79-
To make the concept concrete, this learning path will later demonstrate a small **engineering assistant** example.
80-
The assistant retrieves technical references (e.g., datasheet, programming guide or application note) and generates helpful explanations for software developers.
81-
This use case illustrates how a RAG system can provide **real, contextual knowledge** without retraining the model.
85+
## Create an engineering assistant
86+
87+
You can use this architecture to create an engineering assistant.
88+
89+
The assistant retrieves technical references from datasheets, programming guides, and application notes and and generates helpful explanations for software developers.
90+
91+
This use case illustrates how a RAG system can provide contextual knowledge without retraining the model.
92+
93+
The technology stack you will use is listed below:
8294

8395
| **Stage** | **Technology / Framework** | **Hardware Execution** | **Function** |
8496
|------------|-----------------------------|--------------------------|---------------|
85-
| **Document Processing** | pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. |
86-
| **Embedding Generation** | E5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. |
87-
| **Semantic Retrieval** | FAISS + LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. |
88-
| **Text Generation** | llama.cpp REST Server (GGUF model) | Blackwell GPU + Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. |
89-
| **Pipeline Orchestration** | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. |
90-
| **Unified Memory Architecture** | Unified LPDDR5X Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
97+
| Document Processing | pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. |
98+
| Embedding Generation | e5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. |
99+
| Semantic Retrieval | FAISS and LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. |
100+
| Text Generation | llama.cpp REST Server (GGUF model) | Blackwell GPU and Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. |
101+
| Pipeline Orchestration | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. |
102+
| Unified Memory Architecture | Unified LPDDR5X shared memory | Grace CPU and Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
91103

92104

93105
## Prerequisites Check
94106

95-
In the following content, I am using [EdgeXpert](https://ipc.msi.com/product_detail/Industrial-Computer-Box-PC/AI-Supercomputer/EdgeXpert-MS-C931), a product from [MSI](https://www.msi.com/index.php).
96-
97-
Before proceeding, verify that your GB10 system meets the following:
98-
99-
Run the following commands to confirm your hardware environment:
107+
Before starting, run the following commands to confirm your hardware is ready:
100108

101109
```bash
102110
# Check Arm CPU architecture
103111
lscpu | grep "Architecture"
112+
```
113+
114+
The expected result is:
104115

116+
```output
117+
Architecture: aarch64
118+
```
119+
120+
Print the NVIDIA GPU information:
121+
122+
```bash
105123
# Confirm visible GPU and driver version
106124
nvidia-smi
107125
```
108126

109-
Expected output:
110-
- ***Architecture***: aarch64
111-
- ***CUDA Version***: 13.0 (or later)
112-
- ***Driver Version***: 580.95.05
127+
Look for CUDA version 13.0 or later and Driver version 580.95.05 or later.
113128

114129
{{% notice Note %}}
115-
If your software version is lower than the one mentioned above, it’s recommended to upgrade the driver before proceeding with the next steps.
130+
If your software versions are lower than the versions mentioned above, you should upgrade before proceeding.
116131
{{% /notice %}}
117132

118-
## Wrap-up
119-
120-
In this module, you explored the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture.
121-
You examined how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads.
133+
## Summary
122134

123-
With the conceptual architecture and hardware overview complete, you are now ready to begin hands-on implementation.
124-
In the next module, you will **set up the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM run correctly on the **Grace–Blackwell** platform.
135+
You now understand how RAG works and why Grace–Blackwell is ideal for RAG systems. The unified memory architecture allows the Grace CPU to handle document retrieval while the Blackwell GPU accelerates text generation, all without data copying overhead.
125136

126-
This marks the transition from **theory to practice** — moving from RAG concepts to building your own **hybrid CPU–GPU pipeline** on Grace–Blackwell.
137+
Next, you'll set up your development environment and install the required tools to build this RAG system.

0 commit comments

Comments
 (0)