Update End-to-End RAG Pipeline on Grace–Blackwell

odincodeshen · odincodeshen · commit d8ee52e9e945 · 2025-11-08T16:24:56.000Z
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md
@@ -17,6 +17,8 @@ Typical pipeline:
 
 User Query ─> Embedding ─> Vector Search ─> Context ─> Generation ─> Answer
 
+Each stage in this pipeline plays a distinct role in transforming a user’s question into an accurate, context-aware response:
+
 * ***Embedding model*** (e.g., E5-base-v2): Converts text into dense numerical vectors.
 * ***Vector database*** (e.g., FAISS): Searches for semantically similar chunks.
 * ***Language model*** (e.g., Llama 3.1 8B Instruct – GGUF Q8_0): Generates an answer conditioned on retrieved context.
@@ -26,6 +28,10 @@ More information about RAG system and the challenges of building them can be fou
 
 ## Why Grace–Blackwell (GB10)?
 
+The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads.
+
+Its unique CPU–GPU co-design and Unified Memory enable seamless data exchange, making it an ideal foundation for Retrieval-Augmented Generation (RAG) systems that require both fast document retrieval and high-throughput language model inference.
+
 The GB10 platform integrates:
 - ***Grace CPU (Arm v9.2)*** – 20 cores (10 × Cortex-X925 + 10 × Cortex-A725)
 - ***Blackwell GPU*** – CUDA 13.0 Tensor Core architecture
@@ -71,7 +77,7 @@ Benefits for RAG:
 ```
 
 To make the concept concrete, this learning path will later demonstrate a small **engineering assistant** example.  
-The assistant retrieves technical references (e.g., Arm SDK, TensorRT, or OpenCL documentation) and generates helpful explanations for software developers.  
+The assistant retrieves technical references (e.g., datasheet, programming guide or application note) and generates helpful explanations for software developers.  
 This use case illustrates how a RAG system can provide **real, contextual knowledge** without retraining the model.
 
 | **Stage** | **Technology / Framework** | **Hardware Execution** | **Function** |
@@ -81,7 +87,7 @@ This use case illustrates how a RAG system can provide **real, contextual knowle
 | **Semantic Retrieval** | FAISS + LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. |
 | **Text Generation** | llama.cpp REST Server (GGUF model) | Blackwell GPU + Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. |
 | **Pipeline Orchestration** | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. |
-| **Unified Memory Architecture** | NVLink-C2C Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
+| **Unified Memory Architecture** | Unified LPDDR5X Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
 
 
 ## Prerequisites Check
@@ -105,13 +111,16 @@ Expected output:
 - ***CUDA Version***:   13.0 (or later)
 - ***Driver Version***: 580.95.05
 
+{{% notice Note %}}
+If your software version is lower than the one mentioned above, it’s recommended to upgrade the driver before proceeding with the next steps.
+{{% /notice %}}
 
 ## Wrap-up
 
-In this module, you learned the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture.  
-You explored how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads.
+In this module, you explored the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture.  
+You examined how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads.
 
 With the conceptual architecture and hardware overview complete, you are now ready to begin hands-on implementation.  
-In the next module, you will **prepare the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM are functional on the **Grace–Blackwell platform**.
+In the next module, you will **set up the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM run correctly on the **Grace–Blackwell** platform.
 
-This marks the transition from **theory to practice** — moving from conceptual RAG fundamentals to building your own hybrid CPU–GPU RAG pipeline.
+This marks the transition from **theory to practice** — moving from RAG concepts to building your own **hybrid CPU–GPU pipeline** on Grace–Blackwell.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md
@@ -1,10 +1,10 @@
 ---
-title: Preparing the Environment
+title: Setting Up and Validating the RAG Foundation
 weight: 3
 layout: "learningpathall"
 ---
 
-## Preparing the Environment
+## Setting Up and Validating the RAG Foundation
 
 In the previous session, you verified that your **DGX Spark (GB10)** system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment.
 
@@ -27,19 +27,20 @@ source rag-venv/bin/activate
 
 # Upgrade pip and install base dependencies
 pip install --upgrade pip
-pip install sentence-transformers faiss-cpu \
-            langchain langchain-community langchain-huggingface \
-            huggingface_hub pypdf cryptography tqdm
+pip install torch --index-url https://download.pytorch.org/whl/cpu
+pip install transformers==4.46.2 sentence-transformers==2.7.0 faiss-cpu langchain==1.0.5 \   
+            langchain-community langchain-huggingface huggingface_hub \
+            pypdf tqdm numpy
 ```
 
 **Why these packages?**  
 These libraries provide the essential building blocks of the RAG system:  
 - **sentence-transformers** — used for text embedding with the E5-base-v2 model.  
-- **FAISS** — enables efficient similarity search for document retrieval.  
+- **faiss-cpu** — enables efficient similarity search for document retrieval. Since this pipeline runs on the Grace CPU, the CPU version of FAISS is sufficient — GPU acceleration is not required for this stage. 
 - **LangChain** — manages data orchestration between embedding, retrieval, and generation.  
 - **huggingface_hub** — handles model download and authentication.  
 - **pypdf** — extracts and processes text content from documents.  
-- **cryptography** and **tqdm** — provide secure dependencies and progress visualization.
+- **tqdm** — provide progress visualization.
 
 
 Check installation:
@@ -59,7 +60,7 @@ FAISS GPU: False
 
 ## Step 2 – Model Preparation
 
-Download and organize the models required for the **GB10 Local RAG Blueprint**:
+Download and organize the models required for the **GB10 Local RAG Pipeline**:
 
 - **LLM (Large Language Model)** — llama-3-8b-instruct for text generation.
 - **Embedding Model** — E5-base-v2 for document vectorization.
@@ -77,7 +78,9 @@ hf download intfloat/e5-base-v2 --local-dir ~/models/e5-base-v2
 wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -P ~/models/Llama-3.1-8B-gguf
 ```
 
-Run a short Python script to verify that the **E5-base-v2** model loads correctly and can generate embeddings.
+### Verify the **E5-base-v2** model
+
+Run a Python script to verify that the **E5-base-v2** model loads correctly and can generate embeddings.
 
 ```bash
 from sentence_transformers import SentenceTransformer
@@ -113,7 +116,18 @@ First vector snippet: [-0.012  -0.0062 -0.0008 -0.0014  0.026  -0.0066 -0.0173
  -0.0455]
  ```
 
-A successful output confirms that the E5-base-v2 embedding model is functional and ready for use on the Grace CPU.
+Interpret the E5-base-v2 Result:
+
+- ***Test sentences***: The two example sentences are used to confirm that the model can process text input and generate embeddings correctly. If this step succeeds, it means the model’s tokenizer, encoder, and PyTorch runtime on the Grace CPU are all working together properly.
+- ***Embedding shape (2, 768)***: The two sentences were converted into two 768-dimensional embedding vectors — 768 is the hidden dimension size of this model.
+- ***First vector snippet***: Displays the first 10 values of the first embedding vector. Each number represents a learned feature extracted from the text.
+
+A successful output confirms that the ***E5-base-v2 embedding model*** is functional and ready for use on the Grace CPU.
+
+
+### Verify the **llama-3.1-8B** model
+
+Then, you are going to verify the gguf model.
 
 The **llama.cpp** runtime will be used for text generation.  
 Please ensure that both the **CPU** and **GPU** builds have been installed following the previous [learning path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu/).
@@ -390,7 +404,7 @@ results = db.similarity_search(query, k=3)
 for i, r in enumerate(results, 1):
     print(f"\nResult {i}")
     print(f"Source: {r.metadata.get('source')}")
-    print(r.page_content[:300], "..."
+    print(r.page_content[:300], "...")
 
 query = "Use SWD debug Raspberry Pi Pico"    
 results = db.similarity_search(query, k=3)
@@ -450,9 +464,9 @@ The execution of `check_index.py` confirmed that your local ***FAISS vector inde
 
 You performed two distinct queries targeting different product lines within the Raspberry Pi ecosystem: ***Raspberry Pi 4 power supply*** and ***Raspberry Pi Pico SWD debugging***.
 
-For the first query, ***raspberry pi 4 power supply***, the system returned three highly relevant results, all sourced from the `cm4io-datasheet.txt` file. These passages provided technical guidance on power requirements, supply voltage ranges, and hardware configurations specific to the Compute Module 4 IO Board. This indicates that the embeddings captured the correct semantic intent, and that the FAISS index correctly surfaced content even when specific keywords like ***power supply*** appeared in varied contexts.
+- For the first query, ***raspberry pi 4 power supply***, the system returned three highly relevant results, all sourced from the `cm4io-datasheet.txt` file. These passages provided technical guidance on power requirements, supply voltage ranges, and hardware configurations specific to the Compute Module 4 IO Board. This indicates that the embeddings captured the correct semantic intent, and that the FAISS index correctly surfaced content even when specific keywords like ***power supply*** appeared in varied contexts.
 
-For the second query, ***Use SWD debug Raspberry Pi Pico***, the search retrieved top results from all three relevant datasheets in the Pico family: `pico-datasheet.txt`, `pico-2-datasheet.txt`, and `pico-w-datasheet.txt`. 
+- For the second query, ***Use SWD debug Raspberry Pi Pico***, the search retrieved top results from all three relevant datasheets in the Pico family: `pico-datasheet.txt`, `pico-2-datasheet.txt`, and `pico-w-datasheet.txt`. 
 The extracted passages consistently explained how the ***Serial Wire Debug (SWD)*** port allows developers to reset the system, load and run code without manual input, and perform interactive debugging on the RP2040 or RP2350 microcontrollers. This demonstrates that your chunking and indexing pipeline accurately retained embedded debugging context, and that metadata mapping correctly links each result to its original source document.
 
 This process validates that your system can perform semantic retrieval on technical documents — a core capability of any RAG application.
@@ -469,7 +483,6 @@ In summary, both semantic queries were successfully answered using your local ve
 | Orchestration | Python RAG Script | Grace CPU | Pipeline control |
 | Unified Memory | NVLink-C2C | Shared | Zero-copy data exchange |
 
-
 At this point, your environment is fully configured and validated.
 You have confirmed that the E5-base-v2 embedding model, FAISS index, and Llama 3.1 8B model are all functioning correctly.
 
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md
@@ -6,6 +6,11 @@ layout: "learningpathall"
 
 ## Integrating Retrieval and Generation
 
+In the previous modules, you prepared the environment, validated the ***E5-base-v2*** embedding model, and verified that the ***Llama 3.1 8B*** Instruct model runs successfully on the ***Grace–Blackwell (GB10)*** platform.
+
+In this module, you will bring all components together to build a complete ***Retrieval-Augmented Generation*** (RAG) workflow.
+This stage connects the ***CPU-based retrieval and indexing*** with ***GPU-accelerated language generation***, creating an end-to-end system capable of answering technical questions using real documentation data.
+
 Building upon the previous modules, you will now:
 - Connect the **E5-base-v2** embedding model and FAISS vector index.
 - Integrate the **llama.cpp** REST server for GPU-accelerated inference.
@@ -179,7 +184,7 @@ Follow the previous (learning path) [https://learn.arm.com/learning-paths/laptop
 
 ![image1 CPU–GPU Utilization screenshot](rag_utilization.jpeg "CPU–GPU Utilization")
 
-The figure illustrates how the ***Grace CPU*** and ***Blackwell GPU*** collaborate during ***RAG** execution.
+The figure above illustrates how the ***Grace CPU*** and ***Blackwell GPU*** collaborate during ***RAG** execution.
 On the left, the GPU utilization graph shows a clear spike reaching ***96%***, indicating that the llama.cpp inference engine is actively generating tokens on the GPU.
 Meanwhile, on the right, the htop panel shows multiple Python processes (rag_query_rest.py) running on a single Grace CPU core, maintaining around 93% per-core utilization.
 
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md
@@ -7,7 +7,7 @@ cascade:
 
 minutes_to_complete: 60
 
-who_is_this_for: This learning path teaches how a Retrieval-Augmented Generation (RAG) pipeline operates efficiently in a hybrid CPU–GPU environment on the Grace–Blackwell (GB10) platform. Learners will explore how Arm-based Grace CPUs perform document retrieval and orchestration, while Blackwell GPUs handle language model inference through the open-source llama.cpp REST Server.
+who_is_this_for: This learning path is designed for developers and engineers who want to understand and implement a Retrieval-Augmented Generation (RAG) pipeline optimized for the Grace–Blackwell (GB10) platform. It is ideal for those interested in exploring how Arm-based Grace CPUs manage local document retrieval and orchestration, while Blackwell GPUs accelerate large language model inference through the open-source llama.cpp REST Server. By the end, learners will understand how to build an efficient hybrid CPU–GPU RAG system that leverages Unified Memory for seamless data sharing between computation layers.
 
 learning_objectives:
     - Understand how a RAG system combines document retrieval and language model generation.  
@@ -17,6 +17,7 @@ learning_objectives:
 
 prerequisites:
     - One NVIDIA DGX Spark system with at least 15 GB of available disk space.
+    - Follow the previous [Learning Path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to install both the CPU and GPU builds of llama.cpp.
 
 author: Odin Shen
 
@@ -44,9 +45,10 @@ further_reading:
         link: https://github.com/NVIDIA/dgx-spark-playbooks
         type: documentation
     - resource:
-        title: Arm Blog Post
-        link: https://newsroom.arm.com/blog/arm-powered-nvidia-dgx-spark-ai-workstations
-        type: Blog
+        title: Arm Learning Path
+        link: https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/
+        type: Learning Path
+
 
 ### FIXED, DO NOT MODIFY
 # ================================================================================