You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md
+61-50Lines changed: 61 additions & 50 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,48 +6,54 @@ weight: 2
6
6
layout: learningpathall
7
7
---
8
8
9
-
## What is RAG?
9
+
## Before you start
10
+
11
+
Before starting this Learning Path, you should complete [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to learn about the CPU and GPU builds of llama.cpp. This background is recommended for building the RAG solution on llama.cpp.
12
+
13
+
The NVIDIA DGX Spark is also referred to as the Grace-Blackwell platform or GB10, the name of the NVIDIA Grace-Blackwell Superchip.
10
14
11
-
This module provides the conceptual foundation for how Retrieval-Augmented Generation operates on the ***Grace–Blackwell*** (GB10) platform before you begin building the system in the next steps.
15
+
## What is RAG?
12
16
13
-
**Retrieval-Augmented Generation (RAG)** combines information retrieval with language-model generation.
17
+
Retrieval-Augmented Generation (RAG) combines information retrieval with language-model generation.
14
18
Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses.
Each stage in this pipeline plays a distinct role in transforming a user’s question into an accurate, context-aware response:
24
+
Each stage in this pipeline plays a distinct role in transforming a question into a context-aware response:
21
25
22
-
****Embedding model*** (e.g., E5-base-v2): Converts text into dense numerical vectors.
23
-
****Vector database*** (e.g., FAISS): Searches for semantically similar chunks.
24
-
****Language model*** (e.g., Llama 3.1 8B Instruct – GGUF Q8_0): Generates an answer conditioned on retrieved context.
26
+
* Embedding model: Converts text into dense numerical vectors. An example is e5-base-v2.
27
+
* Vector database: Searches for semantically similar chunks. An example is FAISS.
28
+
* Language model: Generates an answer conditioned on retrieved context. An example is Llama 3.1 8B Instruct.
25
29
26
-
More information about RAG system and the challenges of building them can be found in this [learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/copilot-extension/1-rag/)
30
+
## Why is Grace–Blackwell good for RAG pipelines?
27
31
32
+
The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads.
28
33
29
-
## Why Grace–Blackwell (GB10)?
34
+
Its unique CPU–GPU design and unified memory enable seamless data exchange, making it an ideal foundation for RAG systems that require both fast document retrieval and high-throughput language model inference.
30
35
31
-
The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads.
36
+
The GB10 platform includes:
32
37
33
-
Its unique CPU–GPU co-design and Unified Memory enable seamless data exchange, making it an ideal foundation for Retrieval-Augmented Generation (RAG) systems that require both fast document retrieval and high-throughput language model inference.
38
+
- Grace CPU (Armv9.2 architecture) – 20 cores including 10 Cortex-X925 cores and 10 Cortex-A725 cores
39
+
- Blackwell GPU – CUDA 13.0 Tensor Core architecture
40
+
- Unified Memory (128 GB NVLink-C2C) – Shared address space between CPU and GPU which allows both processors to access the same 128 GB unified memory region without copy operations.
-***Blackwell GPU*** – CUDA 13.0 Tensor Core architecture
38
-
-***Unified Memory (128 GB NVLink-C2C)*** – Shared address space between CPU and GPU. The shared NVLink-C2C interface allows both processors to access the same 128 GB Unified Memory region without copy operations — a key feature validated later in Module 4.
42
+
The GB10 provides the following benefits for RAG applications:
39
43
40
-
Benefits for RAG:
41
-
-***Hybrid execution*** – Grace CPU efficiently handles embedding, indexing, and API orchestration.
- Unified memory – Eliminates CPU to GPU copy overhead because tensors and document vectors share the same memory region.
47
+
- Open-source friendly – Works natively with PyTorch, FAISS, Transformers, and FastAPI.
45
48
46
-
## Conceptual Architecture
49
+
## RAG system architecture
47
50
48
-
```
51
+
Here is a diagram of the architecture:
52
+
53
+
```console
54
+
.
49
55
┌─────────────────────────────────────┐
50
-
│ User Query │
56
+
│ User Query │
51
57
└──────────────┬──────────────────────┘
52
58
│
53
59
▼
@@ -76,51 +82,56 @@ Benefits for RAG:
76
82
77
83
```
78
84
79
-
To make the concept concrete, this learning path will later demonstrate a small **engineering assistant** example.
80
-
The assistant retrieves technical references (e.g., datasheet, programming guide or application note) and generates helpful explanations for software developers.
81
-
This use case illustrates how a RAG system can provide **real, contextual knowledge** without retraining the model.
85
+
## Create an engineering assistant
86
+
87
+
You can use this architecture to create an engineering assistant.
88
+
89
+
The assistant retrieves technical references from datasheets, programming guides, and application notes and and generates helpful explanations for software developers.
90
+
91
+
This use case illustrates how a RAG system can provide contextual knowledge without retraining the model.
92
+
93
+
The technology stack you will use is listed below:
|**Document Processing**| pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. |
86
-
|**Embedding Generation**|E5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. |
87
-
|**Semantic Retrieval**| FAISS + LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. |
88
-
|**Text Generation**| llama.cpp REST Server (GGUF model) | Blackwell GPU + Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. |
89
-
|**Pipeline Orchestration**| Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. |
90
-
|**Unified Memory Architecture**| Unified LPDDR5X Shared Memory| Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
97
+
| Document Processing | pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. |
98
+
| Embedding Generation |e5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. |
99
+
| Semantic Retrieval | FAISS and LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. |
100
+
| Text Generation | llama.cpp REST Server (GGUF model) | Blackwell GPU and Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. |
101
+
| Pipeline Orchestration | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. |
102
+
| Unified Memory Architecture | Unified LPDDR5X shared memory| Grace CPU and Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
91
103
92
104
93
105
## Prerequisites Check
94
106
95
-
In the following content, I am using [EdgeXpert](https://ipc.msi.com/product_detail/Industrial-Computer-Box-PC/AI-Supercomputer/EdgeXpert-MS-C931), a product from [MSI](https://www.msi.com/index.php).
96
-
97
-
Before proceeding, verify that your GB10 system meets the following:
98
-
99
-
Run the following commands to confirm your hardware environment:
107
+
Before starting, run the following commands to confirm your hardware is ready:
100
108
101
109
```bash
102
110
# Check Arm CPU architecture
103
111
lscpu | grep "Architecture"
112
+
```
113
+
114
+
The expected result is:
104
115
116
+
```output
117
+
Architecture: aarch64
118
+
```
119
+
120
+
Print the NVIDIA GPU information:
121
+
122
+
```bash
105
123
# Confirm visible GPU and driver version
106
124
nvidia-smi
107
125
```
108
126
109
-
Expected output:
110
-
-***Architecture***: aarch64
111
-
-***CUDA Version***: 13.0 (or later)
112
-
-***Driver Version***: 580.95.05
127
+
Look for CUDA version 13.0 or later and Driver version 580.95.05 or later.
113
128
114
129
{{% notice Note %}}
115
-
If your software version is lower than the one mentioned above, it’s recommended to upgrade the driver before proceeding with the next steps.
130
+
If your software versions are lower than the versions mentioned above, you should upgrade before proceeding.
116
131
{{% /notice %}}
117
132
118
-
## Wrap-up
119
-
120
-
In this module, you explored the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture.
121
-
You examined how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads.
133
+
## Summary
122
134
123
-
With the conceptual architecture and hardware overview complete, you are now ready to begin hands-on implementation.
124
-
In the next module, you will **set up the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM run correctly on the **Grace–Blackwell** platform.
135
+
You now understand how RAG works and why Grace–Blackwell is ideal for RAG systems. The unified memory architecture allows the Grace CPU to handle document retrieval while the Blackwell GPU accelerates text generation, all without data copying overhead.
125
136
126
-
This marks the transition from **theory to practice** — moving from RAG concepts to building your own **hybrid CPU–GPU pipeline** on Grace–Blackwell.
137
+
Next, you'll set up your development environment and install the required tools to build this RAG system.
0 commit comments