Skip to content

Commit 10f122f

Browse files
Merge pull request #2523 from odincodeshen/feature/gb10_rag
LP: End-to-End RAG Pipeline on Grace–Blackwell
2 parents 66c2cea + d8ee52e commit 10f122f

File tree

7 files changed

+1089
-0
lines changed

7 files changed

+1089
-0
lines changed
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
---
2+
title: Understanding RAG on Grace–Blackwell (GB10)
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## What is RAG?
10+
11+
This module provides the conceptual foundation for how Retrieval-Augmented Generation operates on the ***Grace–Blackwell*** (GB10) platform before you begin building the system in the next steps.
12+
13+
**Retrieval-Augmented Generation (RAG)** combines information retrieval with language-model generation.
14+
Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses.
15+
16+
Typical pipeline:
17+
18+
User Query ─> Embedding ─> Vector Search ─> Context ─> Generation ─> Answer
19+
20+
Each stage in this pipeline plays a distinct role in transforming a user’s question into an accurate, context-aware response:
21+
22+
* ***Embedding model*** (e.g., E5-base-v2): Converts text into dense numerical vectors.
23+
* ***Vector database*** (e.g., FAISS): Searches for semantically similar chunks.
24+
* ***Language model*** (e.g., Llama 3.1 8B Instruct – GGUF Q8_0): Generates an answer conditioned on retrieved context.
25+
26+
More information about RAG system and the challenges of building them can be found in this [learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/copilot-extension/1-rag/)
27+
28+
29+
## Why Grace–Blackwell (GB10)?
30+
31+
The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads.
32+
33+
Its unique CPU–GPU co-design and Unified Memory enable seamless data exchange, making it an ideal foundation for Retrieval-Augmented Generation (RAG) systems that require both fast document retrieval and high-throughput language model inference.
34+
35+
The GB10 platform integrates:
36+
- ***Grace CPU (Arm v9.2)*** – 20 cores (10 × Cortex-X925 + 10 × Cortex-A725)
37+
- ***Blackwell GPU*** – CUDA 13.0 Tensor Core architecture
38+
- ***Unified Memory (128 GB NVLink-C2C)*** – Shared address space between CPU and GPU. The shared NVLink-C2C interface allows both processors to access the same 128 GB Unified Memory region without copy operations — a key feature validated later in Module 4.
39+
40+
Benefits for RAG:
41+
- ***Hybrid execution*** – Grace CPU efficiently handles embedding, indexing, and API orchestration.
42+
- ***GPU acceleration*** – Blackwell GPU performs token generation with low latency.
43+
- ***Unified memory*** – Eliminates CPU↔GPU copy overhead; tensors and document vectors share the same memory region.
44+
- ***Open-source friendly*** – Works natively with PyTorch, FAISS, Transformers, and FastAPI.
45+
46+
## Conceptual Architecture
47+
48+
```
49+
┌─────────────────────────────────────┐
50+
│ User Query │
51+
└──────────────┬──────────────────────┘
52+
53+
54+
┌────────────────────┐
55+
│ Embedding (E5) │
56+
│ → FAISS (CPU) │
57+
└────────────────────┘
58+
59+
60+
┌────────────────────┐
61+
│ Context Builder │
62+
│ (Grace CPU) │
63+
└────────────────────┘
64+
65+
66+
┌───────────────────────────────────────────────┐
67+
│ llama.cpp (GGUF Model, Q8_0) │
68+
│ -ngl 40 --ctx-size 8192 │
69+
│ Grace CPU + Blackwell GPU (split compute) │
70+
└───────────────────────────────────────────────┘
71+
72+
73+
┌────────────────────┐
74+
│ FastAPI Response │
75+
└────────────────────┘
76+
77+
```
78+
79+
To make the concept concrete, this learning path will later demonstrate a small **engineering assistant** example.
80+
The assistant retrieves technical references (e.g., datasheet, programming guide or application note) and generates helpful explanations for software developers.
81+
This use case illustrates how a RAG system can provide **real, contextual knowledge** without retraining the model.
82+
83+
| **Stage** | **Technology / Framework** | **Hardware Execution** | **Function** |
84+
|------------|-----------------------------|--------------------------|---------------|
85+
| **Document Processing** | pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. |
86+
| **Embedding Generation** | E5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. |
87+
| **Semantic Retrieval** | FAISS + LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. |
88+
| **Text Generation** | llama.cpp REST Server (GGUF model) | Blackwell GPU + Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. |
89+
| **Pipeline Orchestration** | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. |
90+
| **Unified Memory Architecture** | Unified LPDDR5X Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
91+
92+
93+
## Prerequisites Check
94+
95+
In the following content, I am using [EdgeXpert](https://ipc.msi.com/product_detail/Industrial-Computer-Box-PC/AI-Supercomputer/EdgeXpert-MS-C931), a product from [MSI](https://www.msi.com/index.php).
96+
97+
Before proceeding, verify that your GB10 system meets the following:
98+
99+
Run the following commands to confirm your hardware environment:
100+
101+
```bash
102+
# Check Arm CPU architecture
103+
lscpu | grep "Architecture"
104+
105+
# Confirm visible GPU and driver version
106+
nvidia-smi
107+
```
108+
109+
Expected output:
110+
- ***Architecture***: aarch64
111+
- ***CUDA Version***: 13.0 (or later)
112+
- ***Driver Version***: 580.95.05
113+
114+
{{% notice Note %}}
115+
If your software version is lower than the one mentioned above, it’s recommended to upgrade the driver before proceeding with the next steps.
116+
{{% /notice %}}
117+
118+
## Wrap-up
119+
120+
In this module, you explored the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture.
121+
You examined how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads.
122+
123+
With the conceptual architecture and hardware overview complete, you are now ready to begin hands-on implementation.
124+
In the next module, you will **set up the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM run correctly on the **Grace–Blackwell** platform.
125+
126+
This marks the transition from **theory to practice** — moving from RAG concepts to building your own **hybrid CPU–GPU pipeline** on Grace–Blackwell.

0 commit comments

Comments
 (0)