|
| 1 | +--- |
| 2 | +title: Understanding RAG on Grace–Blackwell (GB10) |
| 3 | +weight: 2 |
| 4 | + |
| 5 | +### FIXED, DO NOT MODIFY |
| 6 | +layout: learningpathall |
| 7 | +--- |
| 8 | + |
| 9 | +## What is RAG? |
| 10 | + |
| 11 | +This module provides the conceptual foundation for how Retrieval-Augmented Generation operates on the ***Grace–Blackwell*** (GB10) platform before you begin building the system in the next steps. |
| 12 | + |
| 13 | +**Retrieval-Augmented Generation (RAG)** combines information retrieval with language-model generation. |
| 14 | +Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses. |
| 15 | + |
| 16 | +Typical pipeline: |
| 17 | + |
| 18 | +User Query ─> Embedding ─> Vector Search ─> Context ─> Generation ─> Answer |
| 19 | + |
| 20 | +* ***Embedding model*** (e.g., E5-base-v2): Converts text into dense numerical vectors. |
| 21 | +* ***Vector database*** (e.g., FAISS): Searches for semantically similar chunks. |
| 22 | +* ***Language model*** (e.g., Llama 3.1 8B Instruct – GGUF Q8_0): Generates an answer conditioned on retrieved context. |
| 23 | + |
| 24 | +More information about RAG system and the challenges of building them can be found in this [learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/copilot-extension/1-rag/) |
| 25 | + |
| 26 | + |
| 27 | +## Why Grace–Blackwell (GB10)? |
| 28 | + |
| 29 | +The GB10 platform integrates: |
| 30 | +- ***Grace CPU (Arm v9.2)*** – 20 cores (10 × Cortex-X925 + 10 × Cortex-A725) |
| 31 | +- ***Blackwell GPU*** – CUDA 13.0 Tensor Core architecture |
| 32 | +- ***Unified Memory (128 GB NVLink-C2C)*** – Shared address space between CPU and GPU. The shared NVLink-C2C interface allows both processors to access the same 128 GB Unified Memory region without copy operations — a key feature validated later in Module 4. |
| 33 | + |
| 34 | +Benefits for RAG: |
| 35 | +- ***Hybrid execution*** – Grace CPU efficiently handles embedding, indexing, and API orchestration. |
| 36 | +- ***GPU acceleration*** – Blackwell GPU performs token generation with low latency. |
| 37 | +- ***Unified memory*** – Eliminates CPU↔GPU copy overhead; tensors and document vectors share the same memory region. |
| 38 | +- ***Open-source friendly*** – Works natively with PyTorch, FAISS, Transformers, and FastAPI. |
| 39 | + |
| 40 | +## Conceptual Architecture |
| 41 | + |
| 42 | +``` |
| 43 | + ┌─────────────────────────────────────┐ |
| 44 | + │ User Query │ |
| 45 | + └──────────────┬──────────────────────┘ |
| 46 | + │ |
| 47 | + ▼ |
| 48 | + ┌────────────────────┐ |
| 49 | + │ Embedding (E5) │ |
| 50 | + │ → FAISS (CPU) │ |
| 51 | + └────────────────────┘ |
| 52 | + │ |
| 53 | + ▼ |
| 54 | + ┌────────────────────┐ |
| 55 | + │ Context Builder │ |
| 56 | + │ (Grace CPU) │ |
| 57 | + └────────────────────┘ |
| 58 | + │ |
| 59 | + ▼ |
| 60 | + ┌───────────────────────────────────────────────┐ |
| 61 | + │ llama.cpp (GGUF Model, Q8_0) │ |
| 62 | + │ -ngl 40 --ctx-size 8192 │ |
| 63 | + │ Grace CPU + Blackwell GPU (split compute) │ |
| 64 | + └───────────────────────────────────────────────┘ |
| 65 | + │ |
| 66 | + ▼ |
| 67 | + ┌────────────────────┐ |
| 68 | + │ FastAPI Response │ |
| 69 | + └────────────────────┘ |
| 70 | +
|
| 71 | +``` |
| 72 | + |
| 73 | +To make the concept concrete, this learning path will later demonstrate a small **engineering assistant** example. |
| 74 | +The assistant retrieves technical references (e.g., Arm SDK, TensorRT, or OpenCL documentation) and generates helpful explanations for software developers. |
| 75 | +This use case illustrates how a RAG system can provide **real, contextual knowledge** without retraining the model. |
| 76 | + |
| 77 | +| **Stage** | **Technology / Framework** | **Hardware Execution** | **Function** | |
| 78 | +|------------|-----------------------------|--------------------------|---------------| |
| 79 | +| **Document Processing** | pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. | |
| 80 | +| **Embedding Generation** | E5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. | |
| 81 | +| **Semantic Retrieval** | FAISS + LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. | |
| 82 | +| **Text Generation** | llama.cpp REST Server (GGUF model) | Blackwell GPU + Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. | |
| 83 | +| **Pipeline Orchestration** | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. | |
| 84 | +| **Unified Memory Architecture** | NVLink-C2C Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. | |
| 85 | + |
| 86 | + |
| 87 | +## Prerequisites Check |
| 88 | + |
| 89 | +In the following content, I am using [EdgeXpert](https://ipc.msi.com/product_detail/Industrial-Computer-Box-PC/AI-Supercomputer/EdgeXpert-MS-C931), a product from [MSI](https://www.msi.com/index.php). |
| 90 | + |
| 91 | +Before proceeding, verify that your GB10 system meets the following: |
| 92 | + |
| 93 | +Run the following commands to confirm your hardware environment: |
| 94 | + |
| 95 | +```bash |
| 96 | +# Check Arm CPU architecture |
| 97 | +lscpu | grep "Architecture" |
| 98 | + |
| 99 | +# Confirm visible GPU and driver version |
| 100 | +nvidia-smi |
| 101 | +``` |
| 102 | + |
| 103 | +Expected output: |
| 104 | +- ***Architecture***: aarch64 |
| 105 | +- ***CUDA Version***: 13.0 (or later) |
| 106 | +- ***Driver Version***: 580.95.05 |
| 107 | + |
| 108 | + |
| 109 | +## Wrap-up |
| 110 | + |
| 111 | +In this module, you learned the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture. |
| 112 | +You explored how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads. |
| 113 | + |
| 114 | +With the conceptual architecture and hardware overview complete, you are now ready to begin hands-on implementation. |
| 115 | +In the next module, you will **prepare the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM are functional on the **Grace–Blackwell platform**. |
| 116 | + |
| 117 | +This marks the transition from **theory to practice** — moving from conceptual RAG fundamentals to building your own hybrid CPU–GPU RAG pipeline. |
0 commit comments