Skip to content

Commit 038944d

Browse files
Refine RAG documentation for clarity and completeness; update titles, descriptions, and sections for improved user guidance.
1 parent 1ca3d45 commit 038944d

File tree

3 files changed

+15
-15
lines changed

3 files changed

+15
-15
lines changed

content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,24 +8,25 @@ layout: learningpathall
88

99
## Before you start
1010

11-
Before starting this Learning Path, you should complete [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to learn about the CPU and GPU builds of llama.cpp. This background is recommended for building the RAG solution on llama.cpp.
11+
Complete the [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) Learning Path first to understand how to build and run llama.cpp on both the CPU and GPU. This foundational knowledge is essential before you begin building the RAG solution described here.
1212

13-
The NVIDIA DGX Spark is also referred to as the Grace-Blackwell platform or GB10, the name of the NVIDIA Grace-Blackwell Superchip.
13+
{{% notice Note %}}
14+
The NVIDIA DGX Spark is also called the Grace–Blackwell platform or GB10, which refers to the NVIDIA Grace–Blackwell Superchip.
15+
{{% /notice %}}
1416

1517
## What is RAG?
1618

17-
Retrieval-Augmented Generation (RAG) combines information retrieval with language-model generation.
18-
Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses.
19+
Retrieval-Augmented Generation (RAG) combines information retrieval with language-model generation. Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses.
1920

2021
Here is a typical pipeline:
2122

2223
User Query ─> Embedding ─> Vector Search ─> Context ─> Generation ─> Answer
2324

2425
Each stage in this pipeline plays a distinct role in transforming a question into a context-aware response:
2526

26-
* Embedding model: Converts text into dense numerical vectors. An example is e5-base-v2.
27-
* Vector database: Searches for semantically similar chunks. An example is FAISS.
28-
* Language model: Generates an answer conditioned on retrieved context. An example is Llama 3.1 8B Instruct.
27+
* Embedding model: converts text into dense numerical vectors. An example is e5-base-v2.
28+
* Vector database: searches for semantically similar chunks. An example is FAISS.
29+
* Language model: generates an answer conditioned on retrieved context. An example is Llama 3.1 8B Instruct.
2930

3031
## Why is Grace–Blackwell good for RAG pipelines?
3132

content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@ layout: "learningpathall"
66

77
## Observe unified memory performance
88

9-
In this section, you will observe how the Grace CPU and Blackwell GPU share data through unified memory during RAG execution.
9+
In this section, you will learn how to monitor unified memory performance and GPU utilization on Grace–Blackwell systems during Retrieval-Augmented Generation (RAG) AI workloads. By observing real-time system memory and GPU activity, you will verify zero-copy data sharing and efficient hybrid AI inference enabled by the Grace–Blackwell unified memory architecture.
10+
11+
12+
You will start from an idle system state, then progressively launch the RAG model server and run a query, while monitoring both system memory and GPU activity from separate terminals. This hands-on experiment demonstrates how unified memory enables both the Grace CPU and Blackwell GPU to access the same memory space without data movement, optimizing AI inference performance.
1013

1114
You will start from an idle system state, then progressively launch the model server and run a query, while monitoring both system memory and GPU activity from separate terminals.
1215

@@ -21,7 +24,8 @@ Open two terminals on your GB10 system and use them as listed in the table below
2124

2225
You should also have your original terminals open that you used to run the `llama-server` and the RAG queries in the previous section. You will run these again and use the two new terminals for observation.
2326

24-
### Prepare for the experiments
27+
28+
### Prepare for Unified Memory Observation Experiments
2529

2630
Ensure the RAG pipeline is stopped before starting the observation.
2731

content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,11 @@
11
---
22
title: Build a RAG pipeline on NVIDIA DGX Spark
3-
4-
draft: true
5-
cascade:
6-
draft: true
7-
83
minutes_to_complete: 60
94

105
who_is_this_for: This is an advanced topic for developers who want to understand and implement a Retrieval-Augmented Generation (RAG) pipeline on the NVIDIA DGX Spark platform. It is ideal for those interested in exploring how Arm-based Grace CPUs manage local document retrieval and orchestration, while Blackwell GPUs accelerate large language model inference through the open-source llama.cpp REST server.
116

127
learning_objectives:
13-
- Understand how a RAG system combines document retrieval and language model generation.
8+
- Describe how a RAG system combines document retrieval and language model generation.
149
- Deploy a hybrid CPU–GPU RAG pipeline on the GB10 platform using open-source tools.
1510
- Use the llama.cpp REST Server for GPU-accelerated inference with CPU-managed retrieval.
1611
- Build a reproducible RAG application that demonstrates efficient hybrid computing.

0 commit comments

Comments
 (0)