You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each stage in this pipeline plays a distinct role in transforming a user’s question into an accurate, context-aware response:
21
+
20
22
****Embedding model*** (e.g., E5-base-v2): Converts text into dense numerical vectors.
21
23
****Vector database*** (e.g., FAISS): Searches for semantically similar chunks.
22
24
****Language model*** (e.g., Llama 3.1 8B Instruct – GGUF Q8_0): Generates an answer conditioned on retrieved context.
@@ -26,6 +28,10 @@ More information about RAG system and the challenges of building them can be fou
26
28
27
29
## Why Grace–Blackwell (GB10)?
28
30
31
+
The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads.
32
+
33
+
Its unique CPU–GPU co-design and Unified Memory enable seamless data exchange, making it an ideal foundation for Retrieval-Augmented Generation (RAG) systems that require both fast document retrieval and high-throughput language model inference.
-***Blackwell GPU*** – CUDA 13.0 Tensor Core architecture
@@ -71,7 +77,7 @@ Benefits for RAG:
71
77
```
72
78
73
79
To make the concept concrete, this learning path will later demonstrate a small **engineering assistant** example.
74
-
The assistant retrieves technical references (e.g., Arm SDK, TensorRT, or OpenCL documentation) and generates helpful explanations for software developers.
80
+
The assistant retrieves technical references (e.g., datasheet, programming guide or application note) and generates helpful explanations for software developers.
75
81
This use case illustrates how a RAG system can provide **real, contextual knowledge** without retraining the model.
@@ -81,7 +87,7 @@ This use case illustrates how a RAG system can provide **real, contextual knowle
81
87
|**Semantic Retrieval**| FAISS + LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. |
82
88
|**Text Generation**| llama.cpp REST Server (GGUF model) | Blackwell GPU + Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. |
83
89
|**Pipeline Orchestration**| Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. |
84
-
|**Unified Memory Architecture**|NVLink-C2C Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
90
+
|**Unified Memory Architecture**|Unified LPDDR5X Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
85
91
86
92
87
93
## Prerequisites Check
@@ -105,13 +111,16 @@ Expected output:
105
111
-***CUDA Version***: 13.0 (or later)
106
112
-***Driver Version***: 580.95.05
107
113
114
+
{{% notice Note %}}
115
+
If your software version is lower than the one mentioned above, it’s recommended to upgrade the driver before proceeding with the next steps.
116
+
{{% /notice %}}
108
117
109
118
## Wrap-up
110
119
111
-
In this module, you learned the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture.
112
-
You explored how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads.
120
+
In this module, you explored the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture.
121
+
You examined how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads.
113
122
114
123
With the conceptual architecture and hardware overview complete, you are now ready to begin hands-on implementation.
115
-
In the next module, you will **prepare the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM are functional on the **Grace–Blackwell platform**.
124
+
In the next module, you will **set up the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM run correctly on the **Grace–Blackwell** platform.
116
125
117
-
This marks the transition from **theory to practice** — moving from conceptual RAG fundamentals to building your own hybrid CPU–GPU RAG pipeline.
126
+
This marks the transition from **theory to practice** — moving from RAG concepts to building your own **hybrid CPU–GPU pipeline** on Grace–Blackwell.
Copy file name to clipboardExpand all lines: content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md
+27-14Lines changed: 27 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
1
---
2
-
title: Preparing the Environment
2
+
title: Setting Up and Validating the RAG Foundation
3
3
weight: 3
4
4
layout: "learningpathall"
5
5
---
6
6
7
-
## Preparing the Environment
7
+
## Setting Up and Validating the RAG Foundation
8
8
9
9
In the previous session, you verified that your **DGX Spark (GB10)** system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment.
These libraries provide the essential building blocks of the RAG system:
37
38
-**sentence-transformers** — used for text embedding with the E5-base-v2 model.
38
-
-**FAISS** — enables efficient similarity search for document retrieval.
39
+
-**faiss-cpu** — enables efficient similarity search for document retrieval. Since this pipeline runs on the Grace CPU, the CPU version of FAISS is sufficient — GPU acceleration is not required for this stage.
39
40
-**LangChain** — manages data orchestration between embedding, retrieval, and generation.
40
41
-**huggingface_hub** — handles model download and authentication.
41
42
-**pypdf** — extracts and processes text content from documents.
42
-
-**cryptography** and **tqdm** — provide secure dependencies and progress visualization.
43
+
-**tqdm** — provide progress visualization.
43
44
44
45
45
46
Check installation:
@@ -59,7 +60,7 @@ FAISS GPU: False
59
60
60
61
## Step 2 – Model Preparation
61
62
62
-
Download and organize the models required for the **GB10 Local RAG Blueprint**:
63
+
Download and organize the models required for the **GB10 Local RAG Pipeline**:
63
64
64
65
-**LLM (Large Language Model)** — llama-3-8b-instruct for text generation.
65
66
-**Embedding Model** — E5-base-v2 for document vectorization.
A successful output confirms that the E5-base-v2 embedding model is functional and ready for use on the Grace CPU.
119
+
Interpret the E5-base-v2 Result:
120
+
121
+
- ***Test sentences***: The two example sentences are used to confirm that the model can process text input and generate embeddings correctly. If this step succeeds, it means the model’s tokenizer, encoder, and PyTorch runtime on the Grace CPU are all working together properly.
122
+
- ***Embedding shape (2, 768)***: The two sentences were converted into two 768-dimensional embedding vectors — 768 is the hidden dimension size of this model.
123
+
- ***First vector snippet***: Displays the first 10 values of the first embedding vector. Each number represents a learned feature extracted from the text.
124
+
125
+
A successful output confirms that the ***E5-base-v2 embedding model*** is functional and ready for use on the Grace CPU.
126
+
127
+
128
+
### Verify the **llama-3.1-8B** model
129
+
130
+
Then, you are going to verify the gguf model.
117
131
118
132
The **llama.cpp** runtime will be used for text generation.
119
133
Please ensure that both the **CPU** and **GPU** builds have been installed following the previous [learning path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu/).
@@ -450,9 +464,9 @@ The execution of `check_index.py` confirmed that your local ***FAISS vector inde
450
464
451
465
You performed two distinct queries targeting different product lines within the Raspberry Pi ecosystem: ***Raspberry Pi 4 power supply*** and ***Raspberry Pi Pico SWD debugging***.
452
466
453
-
For the first query, ***raspberry pi 4 power supply***, the system returned three highly relevant results, all sourced from the `cm4io-datasheet.txt` file. These passages provided technical guidance on power requirements, supply voltage ranges, and hardware configurations specific to the Compute Module 4 IO Board. This indicates that the embeddings captured the correct semantic intent, and that the FAISS index correctly surfaced content even when specific keywords like ***power supply*** appeared in varied contexts.
467
+
- For the first query, ***raspberry pi 4 power supply***, the system returned three highly relevant results, all sourced from the `cm4io-datasheet.txt` file. These passages provided technical guidance on power requirements, supply voltage ranges, and hardware configurations specific to the Compute Module 4 IO Board. This indicates that the embeddings captured the correct semantic intent, and that the FAISS index correctly surfaced content even when specific keywords like ***power supply*** appeared in varied contexts.
454
468
455
-
For the second query, ***Use SWD debug Raspberry Pi Pico***, the search retrieved top results from all three relevant datasheets in the Pico family: `pico-datasheet.txt`, `pico-2-datasheet.txt`, and `pico-w-datasheet.txt`.
469
+
- For the second query, ***Use SWD debug Raspberry Pi Pico***, the search retrieved top results from all three relevant datasheets in the Pico family: `pico-datasheet.txt`, `pico-2-datasheet.txt`, and `pico-w-datasheet.txt`.
456
470
The extracted passages consistently explained how the ***Serial Wire Debug (SWD)*** port allows developers to reset the system, load and run code without manual input, and perform interactive debugging on the RP2040 or RP2350 microcontrollers. This demonstrates that your chunking and indexing pipeline accurately retained embedded debugging context, and that metadata mapping correctly links each result to its original source document.
457
471
458
472
This process validates that your system can perform semantic retrieval on technical documents — a core capability of any RAG application.
@@ -469,7 +483,6 @@ In summary, both semantic queries were successfully answered using your local ve
469
483
| Orchestration | Python RAG Script | Grace CPU | Pipeline control |
Copy file name to clipboardExpand all lines: content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md
+6-1Lines changed: 6 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,11 @@ layout: "learningpathall"
6
6
7
7
## Integrating Retrieval and Generation
8
8
9
+
In the previous modules, you prepared the environment, validated the ***E5-base-v2*** embedding model, and verified that the ***Llama 3.1 8B*** Instruct model runs successfully on the ***Grace–Blackwell (GB10)*** platform.
10
+
11
+
In this module, you will bring all components together to build a complete ***Retrieval-Augmented Generation*** (RAG) workflow.
12
+
This stage connects the ***CPU-based retrieval and indexing*** with ***GPU-accelerated language generation***, creating an end-to-end system capable of answering technical questions using real documentation data.
13
+
9
14
Building upon the previous modules, you will now:
10
15
- Connect the **E5-base-v2** embedding model and FAISS vector index.
11
16
- Integrate the **llama.cpp** REST server for GPU-accelerated inference.
@@ -179,7 +184,7 @@ Follow the previous (learning path) [https://learn.arm.com/learning-paths/laptop
The figure illustrates how the ***Grace CPU*** and ***Blackwell GPU*** collaborate during ***RAG** execution.
187
+
The figure above illustrates how the ***Grace CPU*** and ***Blackwell GPU*** collaborate during ***RAG** execution.
183
188
On the left, the GPU utilization graph shows a clear spike reaching ***96%***, indicating that the llama.cpp inference engine is actively generating tokens on the GPU.
184
189
Meanwhile, on the right, the htop panel shows multiple Python processes (rag_query_rest.py) running on a single Grace CPU core, maintaining around 93% per-core utilization.
Copy file name to clipboardExpand all lines: content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md
+6-4Lines changed: 6 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ cascade:
7
7
8
8
minutes_to_complete: 60
9
9
10
-
who_is_this_for: This learning path teaches how a Retrieval-Augmented Generation (RAG) pipeline operates efficiently in a hybrid CPU–GPU environment on the Grace–Blackwell (GB10) platform. Learners will explore how Arm-based Grace CPUs perform document retrieval and orchestration, while Blackwell GPUs handle language model inference through the open-source llama.cpp REST Server.
10
+
who_is_this_for: This learning path is designed for developers and engineers who want to understand and implement a Retrieval-Augmented Generation (RAG) pipeline optimized for the Grace–Blackwell (GB10) platform. It is ideal for those interested in exploring how Arm-based Grace CPUs manage local document retrieval and orchestration, while Blackwell GPUs accelerate large language model inference through the open-source llama.cpp REST Server. By the end, learners will understand how to build an efficient hybrid CPU–GPU RAG system that leverages Unified Memory for seamless data sharing between computation layers.
11
11
12
12
learning_objectives:
13
13
- Understand how a RAG system combines document retrieval and language model generation.
@@ -17,6 +17,7 @@ learning_objectives:
17
17
18
18
prerequisites:
19
19
- One NVIDIA DGX Spark system with at least 15 GB of available disk space.
20
+
- Follow the previous [Learning Path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to install both the CPU and GPU builds of llama.cpp.
0 commit comments