Merge pull request #2576 from madeline-underwood/blackwell

jasonrandrews · web-flow · commit 1576f0deec09 · 2025-11-24T13:17:05.000-06:00
Blackwell_JA to sign off
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md
@@ -1,21 +1,20 @@
 ---
-title: Understanding RAG on Grace–Blackwell (GB10)
+title: Explore building a RAG pipeline on Arm-based Grace–Blackwell systems
 weight: 2
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Before you start
+## Get started
 
-Before starting this Learning Path, you should complete [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to learn about the CPU and GPU builds of llama.cpp. This background is recommended for building the RAG solution on llama.cpp.
+Before getting started, you should complete the Learning Path [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to learn about the CPU and GPU builds of llama.cpp. This background is recommended for building the RAG solution on llama.cpp.
 
 The NVIDIA DGX Spark is also referred to as the Grace-Blackwell platform or GB10, the name of the NVIDIA Grace-Blackwell Superchip. 
 
 ## What is RAG?
 
-Retrieval-Augmented Generation (RAG) combines information retrieval with language-model generation.
-Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses.
+Retrieval-Augmented Generation (RAG) combines information retrieval with language-model generation. Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses.
 
 Here is a typical pipeline:
 
@@ -35,9 +34,9 @@ Its unique CPU–GPU design and unified memory enable seamless data exchange, ma
 
 The GB10 platform includes:
 
-- Grace CPU (Armv9.2 architecture) – 20 cores including 10 Cortex-X925 cores and 10 Cortex-A725 cores
-- Blackwell GPU – CUDA 13.0 Tensor Core architecture
-- Unified Memory (128 GB NVLink-C2C) – Shared address space between CPU and GPU which allows both processors to access the same 128 GB unified memory region without copy operations. 
+- Grace CPU (Armv9.2 architecture) - 20 cores including 10 Cortex-X925 cores and 10 Cortex-A725 cores
+- Blackwell GPU - CUDA 13.0 Tensor Core architecture
+- Unified Memory (128 GB NVLink-C2C) - Shared address space between CPU and GPU which allows both processors to access the same 128 GB unified memory region without copy operations. 
 
 The GB10 provides the following benefits for RAG applications:
 
@@ -51,7 +50,7 @@ The GB10 provides the following benefits for RAG applications:
 Here is a diagram of the architecture:
 
 ```console
-.
+ .
                      ┌─────────────────────────────────────┐
                      │         User Query                  │
                      └──────────────┬──────────────────────┘
@@ -102,7 +101,7 @@ The technology stack you will use is listed below:
 | Unified Memory Architecture | Unified LPDDR5X shared memory | Grace CPU and Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
 
 
-## Prerequisites Check
+## Check your setup 
 
 Before starting, run the following commands to confirm your hardware is ready:
 
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_setup.md
@@ -1,5 +1,5 @@
 ---
-title: Configure your development environment and prepare models
+title: Configure the RAG development environment and models
 weight: 3
 layout: "learningpathall"
 ---
@@ -80,11 +80,11 @@ hf download intfloat/e5-base-v2 --local-dir ~/models/e5-base-v2
 wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -P ~/models/Llama-3.1-8B-gguf
 ```
 
-### Verify the e5-base-v2 model
+## Verify the e5-base-v2 model
 
 Run a Python script to verify that the e5-base-v2 model loads correctly and can generate embeddings.
 
-Save the code below in a text file named `vector-test.py`.
+Save the code below in a text file named `vector-test.py`:
 
 ```bash
 from sentence_transformers import SentenceTransformer
@@ -136,7 +136,7 @@ The e5-base-v2 results show:
 
 A successful output confirms that the e5-base-v2 embedding model is functional and ready for use.
 
-### Verify the Llama 3.1 model
+## Verify the Llama 3.1 model
 
 The llama.cpp runtime will be used for text generation using the Llama 3.1 model. 
 
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2b_rag_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2b_rag_setup.md
@@ -1,16 +1,16 @@
 ---
-title: Add documents to the vector database 
+title: Add documents to the RAG vector database 
 weight: 4
 layout: "learningpathall"
 ---
 
-## Prepare a sample document corpus
+## Prepare a sample document corpus for RAG
 
 You are now ready to add your documents to the RAG database that will be used for retrieval and reasoning. 
 
 This converts your raw knowledge documents into clean, chunked text segments that can later be vectorized and indexed by FAISS.
 
-## Understanding FAISS for vector search
+## Use FAISS for efficient vector search on Arm
 
 FAISS (Facebook AI Similarity Search) is an open-source library developed by Meta AI for efficient similarity search and clustering of dense vectors. It's particularly well-suited for RAG applications because it can quickly find the most relevant document chunks from large collections.
 
@@ -21,7 +21,7 @@ Key advantages of FAISS for this application:
 - Speed: Uses advanced indexing algorithms to perform nearest-neighbor searches in milliseconds
 - Flexibility: Supports multiple distance metrics (L2, cosine similarity) and index types
 
-### Create a workspace and data folder
+## Set up your RAG workspace and data folder
 
 Create a directory structure for your data:
 
@@ -57,7 +57,7 @@ Use `wget` to batch download all the PDFs into `~/rag/pdf`.
 wget -P ~/rag/pdf -i datasheet.txt
 ```
 
-### Convert PDF into txt file
+## Convert PDF documents to text files
 
 Then, create a Python file named `pdf2text.py` with the code below:
 
@@ -109,7 +109,7 @@ At the end of the output you see:
 Total converted PDFs: 12
 ```
 
-### Verify your corpus
+## Verify your document corpus
 
 You should now see a number of files in your folder. Run the command below to inspect the results: 
 
@@ -119,7 +119,7 @@ find ~/rag/text/ -type f -name "*.txt" -exec cat {} + | wc -l
 
 It will show how many lines are in total. The number is around 100,000.
 
-## Build an Embedding and Search Index
+## Build an embedding and search index with FAISS
 
 Convert your prepared text corpus into vector embeddings and store them in a FAISS index for efficient semantic search.
 
@@ -133,7 +133,7 @@ This stage enables your RAG pipeline to retrieve the most relevant text chunks w
 
 Use e5-base-v2 to encode the documents and create a FAISS vector index.
 
-### Create the FAISS builder script
+## Create and run the FAISS builder script
 
 
 ```bash
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md
@@ -1,10 +1,10 @@
 ---
-title: Implementing the RAG pipeline
+title: Build and run the RAG pipeline
 weight: 5
 layout: "learningpathall"
 ---
 
-## Integrating retrieval and generation
+## Integrate retrieval and generation on Arm
 
 In the previous sections, you prepared the environment, validated the e5-base-v2 embedding model, and verified that the Llama 3.1 8B Instruct model runs successfully on the Grace–Blackwell (GB10) platform.
 
@@ -17,7 +17,7 @@ Building upon the previous modules, you will now:
 - Integrate the llama.cpp REST server for GPU-accelerated inference.
 - Execute a complete Retrieval-Augmented Generation (RAG) workflow for end-to-end question answering.
 
-### Start the llama.cpp REST server
+## Start the llama.cpp REST server
 
 Before running the RAG query script, ensure the LLM server is active by running:
 
@@ -41,7 +41,7 @@ The output is:
 {"status":"ok"}
 ```
 
-### Create the RAG query script
+## Create the RAG query script
 
 This script performs the full pipeline using the flow:
 
@@ -185,7 +185,7 @@ This demonstrates that the RAG system correctly retrieved relevant sources and g
 
 You can reference the section 5.1.2 on the PDF to verify the result.
 
-### Observe CPU and GPU utilization
+## Observe CPU and GPU utilization
 
 If you have installed `htop` and `nvtop`, you can observe CPU and GPU utilization.
 
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md
@@ -1,12 +1,15 @@
 ---
-title: Observe unified memory performance
+title: Monitor unified memory performance
 weight: 6
 layout: "learningpathall"
 ---
 
 ## Observe unified memory performance
 
-In this section, you will observe how the Grace CPU and Blackwell GPU share data through unified memory during RAG execution.
+In this section, you will learn how to monitor unified memory performance and GPU utilization on Grace–Blackwell systems during Retrieval-Augmented Generation (RAG) AI workloads. By observing real-time system memory and GPU activity, you will verify zero-copy data sharing and efficient hybrid AI inference enabled by the Grace–Blackwell unified memory architecture.
+
+
+You will start from an idle system state, then progressively launch the RAG model server and run a query, while monitoring both system memory and GPU activity from separate terminals. This hands-on experiment demonstrates how unified memory enables both the Grace CPU and Blackwell GPU to access the same memory space without data movement, optimizing AI inference performance.
 
 You will start from an idle system state, then progressively launch the model server and run a query, while monitoring both system memory and GPU activity from separate terminals.
 
@@ -21,11 +24,12 @@ Open two terminals on your GB10 system and use them as listed in the table below
 
 You should also have your original terminals open that you used to run the `llama-server` and the RAG queries in the previous section. You will run these again and use the two new terminals for observation.
 
-### Prepare for the experiments
+
+## Prepare for unified memory observation
 
 Ensure the RAG pipeline is stopped before starting the observation.
 
-#### Terminal 1 - system memory observation
+### Terminal 1:system memory observation
 
 Run the Bash commands below in terminal 1 to print the free memory of the system:
 
@@ -52,7 +56,7 @@ The printed fields are:
 - `free` — Memory not currently allocated or reserved by the system.  
 - `available` — Memory immediately available for new processes, accounting for reclaimable cache and buffers.
 
-#### Terminal 2 – GPU status observation
+### Terminal 2: GPU status observation
 
 Run the Bash commands below in terminal 2 to print the GPU statistics:
 
@@ -85,7 +89,7 @@ Here is an explanation of the fields:
 | `memory.used`        | GPU VRAM usage            | GB10 does not include separate VRAM; all data resides within Unified Memory |
 
 
-### Run the llama-server
+## Run the llama-server
 
 With the idle condition understood, start the `llama.cpp` REST server again in your original terminal, not the two new terminals being used for observation.
 
@@ -134,7 +138,7 @@ The output in monitor terminal 2 is similar to:
 This confirms the model is resident in unified memory, which is visible by the increased system RAM usage.
 
 
-## Execute the RAG Query
+## Execute the RAG query
 
 With the observation code and the `llama-server` still running, run the RAG query in another terminal: 
 
@@ -196,7 +200,7 @@ The GPU executes compute kernels with GPU utilization at 96%, without reading fr
 
 The `utilization.memory=0` and `memory.used=[N/A]` metrics are clear signs that data sharing, not data copying, is happening.
 
-### Observe and interpret unified memory behavior
+## Interpret unified memory behavior
 
 This experiment confirms the Grace–Blackwell Unified Memory architecture in action:
 - The CPU and GPU share the same address space.
@@ -207,7 +211,7 @@ Data does not move — computation moves to the data.
 
 The Grace CPU orchestrates retrieval, and the Blackwell GPU performs generation, both operating within the same Unified Memory pool.
 
-### Summary of unified memory behavior
+## Summary of unified memory behavior
 
 | **Observation**                                    | **Unified Memory Explanation**                           |
 |----------------------------------------------------|----------------------------------------------------------|
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md
@@ -1,22 +1,17 @@
 ---
-title: Build a RAG pipeline on NVIDIA DGX Spark
-
-draft: true
-cascade:
-    draft: true
-
+title: Build a RAG pipeline on Arm-based NVIDIA DGX Spark
 minutes_to_complete: 60
 
-who_is_this_for: This is an advanced topic for developers who want to understand and implement a Retrieval-Augmented Generation (RAG) pipeline on the NVIDIA DGX Spark platform. It is ideal for those interested in exploring how Arm-based Grace CPUs manage local document retrieval and orchestration, while Blackwell GPUs accelerate large language model inference through the open-source llama.cpp REST server.
+who_is_this_for: This is an advanced topic for developers who want to build a Retrieval-Augmented Generation (RAG) pipeline on the NVIDIA DGX Spark platform. You'll learn how Arm-based Grace CPUs handle document retrieval and orchestration, while Blackwell GPUs speed up large language model inference using the open-source llama.cpp REST server. This is a great fit if you're interested in combining Arm CPU management with GPU-accelerated AI workloads.
 
 learning_objectives:
-    - Understand how a RAG system combines document retrieval and language model generation.  
-    - Deploy a hybrid CPU–GPU RAG pipeline on the GB10 platform using open-source tools.
-    - Use the llama.cpp REST Server for GPU-accelerated inference with CPU-managed retrieval.  
-    - Build a reproducible RAG application that demonstrates efficient hybrid computing.  
+    - Describe how a RAG system combines document retrieval and language model generation
+    - Deploy a hybrid CPU-GPU RAG pipeline on the GB10 platform using open-source tools
+    - Use the llama.cpp REST Server for GPU-accelerated inference with CPU-managed retrieval
+    - Build a reproducible RAG application that demonstrates efficient hybrid computing
 
 prerequisites:
-    - An NVIDIA DGX Spark system with at least 15 GB of available disk space.
+    - An NVIDIA DGX Spark system with at least 15 GB of available disk space
 
 author: Odin Shen