Skip to content

Commit fad6304

Browse files
committed
End-to-End RAG Pipeline on Grace–Blackwell
1 parent e1d0920 commit fad6304

File tree

7 files changed

+1060
-0
lines changed

7 files changed

+1060
-0
lines changed
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
---
2+
title: Understanding RAG on Grace–Blackwell (GB10)
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## What is RAG?
10+
11+
This module provides the conceptual foundation for how Retrieval-Augmented Generation operates on the ***Grace–Blackwell*** (GB10) platform before you begin building the system in the next steps.
12+
13+
**Retrieval-Augmented Generation (RAG)** combines information retrieval with language-model generation.
14+
Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses.
15+
16+
Typical pipeline:
17+
18+
User Query ─> Embedding ─> Vector Search ─> Context ─> Generation ─> Answer
19+
20+
* ***Embedding model*** (e.g., E5-base-v2): Converts text into dense numerical vectors.
21+
* ***Vector database*** (e.g., FAISS): Searches for semantically similar chunks.
22+
* ***Language model*** (e.g., Llama 3.1 8B Instruct – GGUF Q8_0): Generates an answer conditioned on retrieved context.
23+
24+
More information about RAG system and the challenges of building them can be found in this [learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/copilot-extension/1-rag/)
25+
26+
27+
## Why Grace–Blackwell (GB10)?
28+
29+
The GB10 platform integrates:
30+
- ***Grace CPU (Arm v9.2)*** – 20 cores (10 × Cortex-X925 + 10 × Cortex-A725)
31+
- ***Blackwell GPU*** – CUDA 13.0 Tensor Core architecture
32+
- ***Unified Memory (128 GB NVLink-C2C)*** – Shared address space between CPU and GPU. The shared NVLink-C2C interface allows both processors to access the same 128 GB Unified Memory region without copy operations — a key feature validated later in Module 4.
33+
34+
Benefits for RAG:
35+
- ***Hybrid execution*** – Grace CPU efficiently handles embedding, indexing, and API orchestration.
36+
- ***GPU acceleration*** – Blackwell GPU performs token generation with low latency.
37+
- ***Unified memory*** – Eliminates CPU↔GPU copy overhead; tensors and document vectors share the same memory region.
38+
- ***Open-source friendly*** – Works natively with PyTorch, FAISS, Transformers, and FastAPI.
39+
40+
## Conceptual Architecture
41+
42+
```
43+
┌─────────────────────────────────────┐
44+
│ User Query │
45+
└──────────────┬──────────────────────┘
46+
47+
48+
┌────────────────────┐
49+
│ Embedding (E5) │
50+
│ → FAISS (CPU) │
51+
└────────────────────┘
52+
53+
54+
┌────────────────────┐
55+
│ Context Builder │
56+
│ (Grace CPU) │
57+
└────────────────────┘
58+
59+
60+
┌───────────────────────────────────────────────┐
61+
│ llama.cpp (GGUF Model, Q8_0) │
62+
│ -ngl 40 --ctx-size 8192 │
63+
│ Grace CPU + Blackwell GPU (split compute) │
64+
└───────────────────────────────────────────────┘
65+
66+
67+
┌────────────────────┐
68+
│ FastAPI Response │
69+
└────────────────────┘
70+
71+
```
72+
73+
To make the concept concrete, this learning path will later demonstrate a small **engineering assistant** example.
74+
The assistant retrieves technical references (e.g., Arm SDK, TensorRT, or OpenCL documentation) and generates helpful explanations for software developers.
75+
This use case illustrates how a RAG system can provide **real, contextual knowledge** without retraining the model.
76+
77+
| **Stage** | **Technology / Framework** | **Hardware Execution** | **Function** |
78+
|------------|-----------------------------|--------------------------|---------------|
79+
| **Document Processing** | pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. |
80+
| **Embedding Generation** | E5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. |
81+
| **Semantic Retrieval** | FAISS + LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. |
82+
| **Text Generation** | llama.cpp REST Server (GGUF model) | Blackwell GPU + Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. |
83+
| **Pipeline Orchestration** | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. |
84+
| **Unified Memory Architecture** | NVLink-C2C Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
85+
86+
87+
## Prerequisites Check
88+
89+
In the following content, I am using [EdgeXpert](https://ipc.msi.com/product_detail/Industrial-Computer-Box-PC/AI-Supercomputer/EdgeXpert-MS-C931), a product from [MSI](https://www.msi.com/index.php).
90+
91+
Before proceeding, verify that your GB10 system meets the following:
92+
93+
Run the following commands to confirm your hardware environment:
94+
95+
```bash
96+
# Check Arm CPU architecture
97+
lscpu | grep "Architecture"
98+
99+
# Confirm visible GPU and driver version
100+
nvidia-smi
101+
```
102+
103+
Expected output:
104+
- ***Architecture***: aarch64
105+
- ***CUDA Version***: 13.0 (or later)
106+
- ***Driver Version***: 580.95.05
107+
108+
109+
## Wrap-up
110+
111+
In this module, you learned the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture.
112+
You explored how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads.
113+
114+
With the conceptual architecture and hardware overview complete, you are now ready to begin hands-on implementation.
115+
In the next module, you will **prepare the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM are functional on the **Grace–Blackwell platform**.
116+
117+
This marks the transition from **theory to practice** — moving from conceptual RAG fundamentals to building your own hybrid CPU–GPU RAG pipeline.

0 commit comments

Comments
 (0)