Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 61 additions & 50 deletions content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,48 +6,54 @@ weight: 2
layout: learningpathall
---

## What is RAG?
## Before you start

Before starting this Learning Path, you should complete [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to learn about the CPU and GPU builds of llama.cpp. This background is recommended for building the RAG solution on llama.cpp.

The NVIDIA DGX Spark is also referred to as the Grace-Blackwell platform or GB10, the name of the NVIDIA Grace-Blackwell Superchip.

This module provides the conceptual foundation for how Retrieval-Augmented Generation operates on the ***Grace–Blackwell*** (GB10) platform before you begin building the system in the next steps.
## What is RAG?

**Retrieval-Augmented Generation (RAG)** combines information retrieval with language-model generation.
Retrieval-Augmented Generation (RAG) combines information retrieval with language-model generation.
Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses.

Typical pipeline:
Here is a typical pipeline:

User Query ─> Embedding ─> Vector Search ─> Context ─> Generation ─> Answer

Each stage in this pipeline plays a distinct role in transforming a user’s question into an accurate, context-aware response:
Each stage in this pipeline plays a distinct role in transforming a question into a context-aware response:

* ***Embedding model*** (e.g., E5-base-v2): Converts text into dense numerical vectors.
* ***Vector database*** (e.g., FAISS): Searches for semantically similar chunks.
* ***Language model*** (e.g., Llama 3.1 8B Instruct – GGUF Q8_0): Generates an answer conditioned on retrieved context.
* Embedding model: Converts text into dense numerical vectors. An example is e5-base-v2.
* Vector database: Searches for semantically similar chunks. An example is FAISS.
* Language model: Generates an answer conditioned on retrieved context. An example is Llama 3.1 8B Instruct.

More information about RAG system and the challenges of building them can be found in this [learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/copilot-extension/1-rag/)
## Why is Grace–Blackwell good for RAG pipelines?

The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads.

## Why Grace–Blackwell (GB10)?
Its unique CPU–GPU design and unified memory enable seamless data exchange, making it an ideal foundation for RAG systems that require both fast document retrieval and high-throughput language model inference.

The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads.
The GB10 platform includes:

Its unique CPU–GPU co-design and Unified Memory enable seamless data exchange, making it an ideal foundation for Retrieval-Augmented Generation (RAG) systems that require both fast document retrieval and high-throughput language model inference.
- Grace CPU (Armv9.2 architecture) – 20 cores including 10 Cortex-X925 cores and 10 Cortex-A725 cores
- Blackwell GPU – CUDA 13.0 Tensor Core architecture
- Unified Memory (128 GB NVLink-C2C) – Shared address space between CPU and GPU which allows both processors to access the same 128 GB unified memory region without copy operations.

The GB10 platform integrates:
- ***Grace CPU (Arm v9.2)*** – 20 cores (10 × Cortex-X925 + 10 × Cortex-A725)
- ***Blackwell GPU*** – CUDA 13.0 Tensor Core architecture
- ***Unified Memory (128 GB NVLink-C2C)*** – Shared address space between CPU and GPU. The shared NVLink-C2C interface allows both processors to access the same 128 GB Unified Memory region without copy operations — a key feature validated later in Module 4.
The GB10 provides the following benefits for RAG applications:

Benefits for RAG:
- ***Hybrid execution*** – Grace CPU efficiently handles embedding, indexing, and API orchestration.
- ***GPU acceleration*** – Blackwell GPU performs token generation with low latency.
- ***Unified memory*** – Eliminates CPU↔GPU copy overhead; tensors and document vectors share the same memory region.
- ***Open-source friendly*** – Works natively with PyTorch, FAISS, Transformers, and FastAPI.
- Hybrid execution – Grace CPU efficiently handles embedding, indexing, and API orchestration.
- GPU acceleration – Blackwell GPU performs token generation with low latency.
- Unified memory – Eliminates CPU to GPU copy overhead because tensors and document vectors share the same memory region.
- Open-source friendly – Works natively with PyTorch, FAISS, Transformers, and FastAPI.

## Conceptual Architecture
## RAG system architecture

```
Here is a diagram of the architecture:

```console
.
┌─────────────────────────────────────┐
User Query
User Query
└──────────────┬──────────────────────┘
Expand Down Expand Up @@ -76,51 +82,56 @@ Benefits for RAG:

```

To make the concept concrete, this learning path will later demonstrate a small **engineering assistant** example.
The assistant retrieves technical references (e.g., datasheet, programming guide or application note) and generates helpful explanations for software developers.
This use case illustrates how a RAG system can provide **real, contextual knowledge** without retraining the model.
## Create an engineering assistant

You can use this architecture to create an engineering assistant.

The assistant retrieves technical references from datasheets, programming guides, and application notes and and generates helpful explanations for software developers.

This use case illustrates how a RAG system can provide contextual knowledge without retraining the model.

The technology stack you will use is listed below:

| **Stage** | **Technology / Framework** | **Hardware Execution** | **Function** |
|------------|-----------------------------|--------------------------|---------------|
| **Document Processing** | pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. |
| **Embedding Generation** | E5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. |
| **Semantic Retrieval** | FAISS + LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. |
| **Text Generation** | llama.cpp REST Server (GGUF model) | Blackwell GPU + Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. |
| **Pipeline Orchestration** | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. |
| **Unified Memory Architecture** | Unified LPDDR5X Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |
| Document Processing | pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. |
| Embedding Generation | e5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. |
| Semantic Retrieval | FAISS and LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. |
| Text Generation | llama.cpp REST Server (GGUF model) | Blackwell GPU and Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. |
| Pipeline Orchestration | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. |
| Unified Memory Architecture | Unified LPDDR5X shared memory | Grace CPU and Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. |


## Prerequisites Check

In the following content, I am using [EdgeXpert](https://ipc.msi.com/product_detail/Industrial-Computer-Box-PC/AI-Supercomputer/EdgeXpert-MS-C931), a product from [MSI](https://www.msi.com/index.php).

Before proceeding, verify that your GB10 system meets the following:

Run the following commands to confirm your hardware environment:
Before starting, run the following commands to confirm your hardware is ready:

```bash
# Check Arm CPU architecture
lscpu | grep "Architecture"
```

The expected result is:

```output
Architecture: aarch64
```

Print the NVIDIA GPU information:

```bash
# Confirm visible GPU and driver version
nvidia-smi
```

Expected output:
- ***Architecture***: aarch64
- ***CUDA Version***: 13.0 (or later)
- ***Driver Version***: 580.95.05
Look for CUDA version 13.0 or later and Driver version 580.95.05 or later.

{{% notice Note %}}
If your software version is lower than the one mentioned above, it’s recommended to upgrade the driver before proceeding with the next steps.
If your software versions are lower than the versions mentioned above, you should upgrade before proceeding.
{{% /notice %}}

## Wrap-up

In this module, you explored the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture.
You examined how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads.
## Summary

With the conceptual architecture and hardware overview complete, you are now ready to begin hands-on implementation.
In the next module, you will **set up the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM run correctly on the **Grace–Blackwell** platform.
You now understand how RAG works and why Grace–Blackwell is ideal for RAG systems. The unified memory architecture allows the Grace CPU to handle document retrieval while the Blackwell GPU accelerates text generation, all without data copying overhead.

This marks the transition from **theory to practice** — moving from RAG concepts to building your own **hybrid CPU–GPU pipeline** on Grace–Blackwell.
Next, you'll set up your development environment and install the required tools to build this RAG system.
Loading