diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md index 598baad28..d9aedfeef 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md @@ -6,48 +6,54 @@ weight: 2 layout: learningpathall --- -## What is RAG? +## Before you start + +Before starting this Learning Path, you should complete [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to learn about the CPU and GPU builds of llama.cpp. This background is recommended for building the RAG solution on llama.cpp. + +The NVIDIA DGX Spark is also referred to as the Grace-Blackwell platform or GB10, the name of the NVIDIA Grace-Blackwell Superchip. -This module provides the conceptual foundation for how Retrieval-Augmented Generation operates on the ***Grace–Blackwell*** (GB10) platform before you begin building the system in the next steps. +## What is RAG? -**Retrieval-Augmented Generation (RAG)** combines information retrieval with language-model generation. +Retrieval-Augmented Generation (RAG) combines information retrieval with language-model generation. Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses. -Typical pipeline: +Here is a typical pipeline: User Query ─> Embedding ─> Vector Search ─> Context ─> Generation ─> Answer -Each stage in this pipeline plays a distinct role in transforming a user’s question into an accurate, context-aware response: +Each stage in this pipeline plays a distinct role in transforming a question into a context-aware response: -* ***Embedding model*** (e.g., E5-base-v2): Converts text into dense numerical vectors. -* ***Vector database*** (e.g., FAISS): Searches for semantically similar chunks. -* ***Language model*** (e.g., Llama 3.1 8B Instruct – GGUF Q8_0): Generates an answer conditioned on retrieved context. +* Embedding model: Converts text into dense numerical vectors. An example is e5-base-v2. +* Vector database: Searches for semantically similar chunks. An example is FAISS. +* Language model: Generates an answer conditioned on retrieved context. An example is Llama 3.1 8B Instruct. -More information about RAG system and the challenges of building them can be found in this [learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/copilot-extension/1-rag/) +## Why is Grace–Blackwell good for RAG pipelines? +The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads. -## Why Grace–Blackwell (GB10)? +Its unique CPU–GPU design and unified memory enable seamless data exchange, making it an ideal foundation for RAG systems that require both fast document retrieval and high-throughput language model inference. -The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads. +The GB10 platform includes: -Its unique CPU–GPU co-design and Unified Memory enable seamless data exchange, making it an ideal foundation for Retrieval-Augmented Generation (RAG) systems that require both fast document retrieval and high-throughput language model inference. +- Grace CPU (Armv9.2 architecture) – 20 cores including 10 Cortex-X925 cores and 10 Cortex-A725 cores +- Blackwell GPU – CUDA 13.0 Tensor Core architecture +- Unified Memory (128 GB NVLink-C2C) – Shared address space between CPU and GPU which allows both processors to access the same 128 GB unified memory region without copy operations. -The GB10 platform integrates: -- ***Grace CPU (Arm v9.2)*** – 20 cores (10 × Cortex-X925 + 10 × Cortex-A725) -- ***Blackwell GPU*** – CUDA 13.0 Tensor Core architecture -- ***Unified Memory (128 GB NVLink-C2C)*** – Shared address space between CPU and GPU. The shared NVLink-C2C interface allows both processors to access the same 128 GB Unified Memory region without copy operations — a key feature validated later in Module 4. +The GB10 provides the following benefits for RAG applications: -Benefits for RAG: -- ***Hybrid execution*** – Grace CPU efficiently handles embedding, indexing, and API orchestration. -- ***GPU acceleration*** – Blackwell GPU performs token generation with low latency. -- ***Unified memory*** – Eliminates CPU↔GPU copy overhead; tensors and document vectors share the same memory region. -- ***Open-source friendly*** – Works natively with PyTorch, FAISS, Transformers, and FastAPI. +- Hybrid execution – Grace CPU efficiently handles embedding, indexing, and API orchestration. +- GPU acceleration – Blackwell GPU performs token generation with low latency. +- Unified memory – Eliminates CPU to GPU copy overhead because tensors and document vectors share the same memory region. +- Open-source friendly – Works natively with PyTorch, FAISS, Transformers, and FastAPI. -## Conceptual Architecture +## RAG system architecture -``` +Here is a diagram of the architecture: + +```console +. ┌─────────────────────────────────────┐ - │ User Query │ + │ User Query │ └──────────────┬──────────────────────┘ │ ▼ @@ -76,51 +82,56 @@ Benefits for RAG: ``` -To make the concept concrete, this learning path will later demonstrate a small **engineering assistant** example. -The assistant retrieves technical references (e.g., datasheet, programming guide or application note) and generates helpful explanations for software developers. -This use case illustrates how a RAG system can provide **real, contextual knowledge** without retraining the model. +## Create an engineering assistant + +You can use this architecture to create an engineering assistant. + +The assistant retrieves technical references from datasheets, programming guides, and application notes and and generates helpful explanations for software developers. + +This use case illustrates how a RAG system can provide contextual knowledge without retraining the model. + +The technology stack you will use is listed below: | **Stage** | **Technology / Framework** | **Hardware Execution** | **Function** | |------------|-----------------------------|--------------------------|---------------| -| **Document Processing** | pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. | -| **Embedding Generation** | E5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. | -| **Semantic Retrieval** | FAISS + LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. | -| **Text Generation** | llama.cpp REST Server (GGUF model) | Blackwell GPU + Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. | -| **Pipeline Orchestration** | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. | -| **Unified Memory Architecture** | Unified LPDDR5X Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. | +| Document Processing | pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. | +| Embedding Generation | e5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. | +| Semantic Retrieval | FAISS and LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. | +| Text Generation | llama.cpp REST Server (GGUF model) | Blackwell GPU and Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. | +| Pipeline Orchestration | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. | +| Unified Memory Architecture | Unified LPDDR5X shared memory | Grace CPU and Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. | ## Prerequisites Check -In the following content, I am using [EdgeXpert](https://ipc.msi.com/product_detail/Industrial-Computer-Box-PC/AI-Supercomputer/EdgeXpert-MS-C931), a product from [MSI](https://www.msi.com/index.php). - -Before proceeding, verify that your GB10 system meets the following: - -Run the following commands to confirm your hardware environment: +Before starting, run the following commands to confirm your hardware is ready: ```bash # Check Arm CPU architecture lscpu | grep "Architecture" +``` + +The expected result is: +```output +Architecture: aarch64 +``` + +Print the NVIDIA GPU information: + +```bash # Confirm visible GPU and driver version nvidia-smi ``` -Expected output: -- ***Architecture***: aarch64 -- ***CUDA Version***: 13.0 (or later) -- ***Driver Version***: 580.95.05 +Look for CUDA version 13.0 or later and Driver version 580.95.05 or later. {{% notice Note %}} -If your software version is lower than the one mentioned above, it’s recommended to upgrade the driver before proceeding with the next steps. +If your software versions are lower than the versions mentioned above, you should upgrade before proceeding. {{% /notice %}} -## Wrap-up - -In this module, you explored the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture. -You examined how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads. +## Summary -With the conceptual architecture and hardware overview complete, you are now ready to begin hands-on implementation. -In the next module, you will **set up the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM run correctly on the **Grace–Blackwell** platform. +You now understand how RAG works and why Grace–Blackwell is ideal for RAG systems. The unified memory architecture allows the Grace CPU to handle document retrieval while the Blackwell GPU accelerates text generation, all without data copying overhead. -This marks the transition from **theory to practice** — moving from RAG concepts to building your own **hybrid CPU–GPU pipeline** on Grace–Blackwell. \ No newline at end of file +Next, you'll set up your development environment and install the required tools to build this RAG system. \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md index 13b660c40..4be023b5b 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md @@ -1,31 +1,26 @@ --- -title: Setting Up and Validating the RAG Foundation +title: Configure your development environment and prepare models weight: 3 layout: "learningpathall" --- -## Setting Up and Validating the RAG Foundation +## Create the development environment -In the previous session, you verified that your **DGX Spark (GB10)** system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment. +To get started, you need to set up your development environment and prepare the embedding model and the LLM you will use in the RAG pipeline. -This module prepares the software and data foundation that enables the RAG workflow in later stages. +The embedding model for the solution is e5-base-v2, and the LLM is Llama 3.1 8B Instruct. -In this module, you will: -- Set up and validate the core environment for the RAG pipeline. -- Load and test the **E5-base-v2** embedding model. -- Build a local **FAISS** index for document retrieval. -- Prepare and verify the **Llama 3.1 8B Instruct** model for text generation. -- Confirm GPU acceleration and overall system readiness. - -## Step 1 - Create The Development Environment +First, create a Python virtual environment to use for the project: ```bash -# Create and activate a virtual environment cd ~ python3 -m venv rag-venv source rag-venv/bin/activate +``` -# Upgrade pip and install base dependencies +Next, install the required packages: + +```bash pip install --upgrade pip pip install torch --index-url https://download.pytorch.org/whl/cpu pip install transformers==4.46.2 sentence-transformers==2.7.0 faiss-cpu langchain==1.0.5 \ @@ -33,17 +28,19 @@ pip install transformers==4.46.2 sentence-transformers==2.7.0 faiss-cpu langchai pypdf tqdm numpy ``` -**Why these packages?** -These libraries provide the essential building blocks of the RAG system: -- **sentence-transformers** — used for text embedding with the E5-base-v2 model. -- **faiss-cpu** — enables efficient similarity search for document retrieval. Since this pipeline runs on the Grace CPU, the CPU version of FAISS is sufficient — GPU acceleration is not required for this stage. -- **LangChain** — manages data orchestration between embedding, retrieval, and generation. -- **huggingface_hub** — handles model download and authentication. -- **pypdf** — extracts and processes text content from documents. -- **tqdm** — provide progress visualization. +These packages provide the essential building blocks of the RAG system: +- `sentence-transformers` is used for text embedding with the e5-base-v2 model. +- `faiss-cpu` enables efficient similarity search for document retrieval. +- `langchain` manages data orchestration between embedding, retrieval, and generation. +- `huggingface_hub` is used for model download and authentication. +- `pypdf` extracts and processes text content from documents. +- `tqdm` provides progress visualization. + +Since the pipeline runs on the Grace CPU, the CPU version of FAISS is sufficient and GPU acceleration is not required. + +Check the installation by printing the FAISS version: -Check installation: ```bash python - <<'EOF' import faiss, transformers @@ -52,21 +49,26 @@ print("FAISS GPU:", faiss.get_num_gpus() > 0) EOF ``` -The output confirms that FAISS is running in CPU mode (FAISS GPU: False), which is expected for this setup. -``` -FAISS version: 1.12.0 +The output confirms that FAISS is running in CPU mode. + +```output +FAISS version: 1.13.0 FAISS GPU: False ``` -## Step 2 – Model Preparation +## Model preparation + +Download and organize the models required for the RAG pipeline. -Download and organize the models required for the **GB10 Local RAG Pipeline**: +The two models are: -- **LLM (Large Language Model)** — llama-3-8b-instruct for text generation. -- **Embedding Model** — E5-base-v2 for document vectorization. +- The Large Language Model (LLM) is Llama 3.1 8B Instruct for text generation. +- The Embedding Model is e5-base-v2 for document vectorization. Both models will be stored locally under the `~/models` directory for offline operation. +You will need a Hugging Face token to get the embedding model. The instructions will be printed when you run `hf auth login` providing a link to generate a token. + ```bash mkdir -p ~/models && cd ~/models @@ -74,13 +76,15 @@ mkdir -p ~/models && cd ~/models hf auth login hf download intfloat/e5-base-v2 --local-dir ~/models/e5-base-v2 -# Download GGUF version of llama-3.1 8B model to save the time for local conversion +# Download GGUF version of Llama 3.1 8B model to save the time for local conversion wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -P ~/models/Llama-3.1-8B-gguf ``` -### Verify the **E5-base-v2** model +### Verify the e5-base-v2 model -Run a Python script to verify that the **E5-base-v2** model loads correctly and can generate embeddings. +Run a Python script to verify that the e5-base-v2 model loads correctly and can generate embeddings. + +Save the code below in a text file named `vector-test.py`. ```bash from sentence_transformers import SentenceTransformer @@ -108,31 +112,37 @@ except Exception as e: print(f" Model failed to load or generate embeddings: {e}") ``` -Expected output should confirm the E5-base-v2 model can generate embeddings successfully.” + +Run the code with: + +```bash +python ./vector-test.py ``` + +The output confirms the e5-base-v2 model can generate embeddings successfully. + +```output Model loaded and embeddings generated successfully. Embedding shape: (2, 768) First vector snippet: [-0.012 -0.0062 -0.0008 -0.0014 0.026 -0.0066 -0.0173 0.026 -0.0238 -0.0455] ``` -Interpret the E5-base-v2 Result: +The e5-base-v2 results show: -- ***Test sentences***: The two example sentences are used to confirm that the model can process text input and generate embeddings correctly. If this step succeeds, it means the model’s tokenizer, encoder, and PyTorch runtime on the Grace CPU are all working together properly. -- ***Embedding shape (2, 768)***: The two sentences were converted into two 768-dimensional embedding vectors — 768 is the hidden dimension size of this model. -- ***First vector snippet***: Displays the first 10 values of the first embedding vector. Each number represents a learned feature extracted from the text. +- Test sentences: The two example sentences are used to confirm that the model can process text input and generate embeddings correctly. If this step succeeds, the model's tokenizer, encoder, and PyTorch runtime on the Grace CPU are all working together properly. +- Embedding shape (2, 768): The two sentences were converted into two 768-dimensional embedding vectors. 768 is the hidden dimension size of this model. +- First vector snippet: Displays the first 10 values of the first embedding vector. Each number represents a learned feature extracted from the text. -A successful output confirms that the ***E5-base-v2 embedding model*** is functional and ready for use on the Grace CPU. +A successful output confirms that the e5-base-v2 embedding model is functional and ready for use. +### Verify the Llama 3.1 model -### Verify the **llama-3.1-8B** model +The llama.cpp runtime will be used for text generation using the Llama 3.1 model. -Then, you are going to verify the gguf model. +Ensure that both the CPU and the GPU builds of llama.cpp have been installed. You can find the instructions in [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/). -The **llama.cpp** runtime will be used for text generation. -Please ensure that both the **CPU** and **GPU** builds have been installed following the previous [learning path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu/). - -Perform a quick verification test on `llama-3.1-8B-Q8_0.gguf` +Verify the `Llama-3.1-8B-Q8_0.gguf` model is working using llama.cpp: ```bash cd ~/llama.cpp/build-gpu @@ -145,11 +155,11 @@ cd ~/llama.cpp/build-gpu You should see the model load successfully and print a short generated sentence, for example: -``` +```output Hello from this end! What brings you to this chat? Do you have any questions or topics you'd like to discuss? I'm here to help! ``` -Then, you need to check ***REST Server*** which you will need to use in RAG pipeline in next session. +Next, check the REST Server, which is needed for the RAG pipeline: ```bash ./bin/llama-server \ @@ -159,7 +169,7 @@ Then, you need to check ***REST Server*** which you will need to use in RAG pipe --host 0.0.0.0 ``` -Use another terminal in the same machine to do the health checking: +Use another terminal on the same machine to do the health check: ```bash curl http://127.0.0.1:8000/completion \ @@ -167,323 +177,12 @@ curl http://127.0.0.1:8000/completion \ -d '{"prompt": "Explain why unified memory improves CPU–GPU collaboration.", "n_predict": 64}' ``` -A short JSON payload containing a coherent explanation generated by the model. +You should see a short JSON payload containing a coherent explanation generated by the model. + +Terminate the `llama-server` using Ctrl-C. {{% notice Note %}} -To test remote access from another machine, replace `127.0.0.1` with the GB10 IP address. +To test remote access from another machine, replace `127.0.0.1` with the IP address of the machine running `llama-server`. {{% /notice %}} - -## Step 3 – Prepare a Sample Document Corpus - -Prepare the text corpus that your **RAG system** will use for retrieval and reasoning. -This stage converts your raw knowledge documents into clean, chunked text segments that can later be **vectorized and indexed** by FAISS. - -### Create a workspace and data folder -We’ll use a consistent directory layout so later scripts can find your data easily. - -```bash -mkdir -p ~/rag && cd ~/rag -mkdir pdf text -``` - -List all the source PDF URLs into a file, one per line. -In this learning path, we collect all of Raspberry Pi datasheet links into file called `datasheet.txt` - -``` -https://datasheets.raspberrypi.com/cm/cm1-and-cm3-datasheet.pdf -https://datasheets.raspberrypi.com/cm/cm3-plus-datasheet.pdf -https://datasheets.raspberrypi.com/cm4/cm4-datasheet.pdf -https://datasheets.raspberrypi.com/cm4io/cm4io-datasheet.pdf -https://datasheets.raspberrypi.com/cm4s/cm4s-datasheet.pdf -https://datasheets.raspberrypi.com/pico/pico-2-datasheet.pdf -https://datasheets.raspberrypi.com/pico/pico-datasheet.pdf -https://datasheets.raspberrypi.com/picow/pico-2-w-datasheet.pdf -https://datasheets.raspberrypi.com/picow/pico-w-datasheet.pdf -https://datasheets.raspberrypi.com/rp2040/rp2040-datasheet.pdf -https://datasheets.raspberrypi.com/rp2350/rp2350-datasheet.pdf -https://datasheets.raspberrypi.com/rpi4/raspberry-pi-4-datasheet.pdf -``` - -Use `wget` to batch download all of pdf into `~/rag/pdf` -```bash -wget -P ~/rag/pdf -i datasheet.txt -``` - -### Convert PDF into txt file - -Then, create a python file `pdf2text.py` - -```python -from pypdf import PdfReader -import glob, os - -pdf_root = os.path.expanduser("~/rag/pdf") -txt_root = os.path.expanduser("~/rag/text") -os.makedirs(txt_root, exist_ok=True) - -count = 0 -for file in glob.glob(os.path.join(pdf_root, "**/*.pdf"), recursive=True): - print(f"File processing {file}") - try: - reader = PdfReader(file) - text = "\n".join(page.extract_text() or "" for page in reader.pages) - - rel_path = os.path.relpath(file, pdf_root) - txt_path = os.path.join(txt_root, os.path.splitext(rel_path)[0] + ".txt") - os.makedirs(os.path.dirname(txt_path), exist_ok=True) - - with open(txt_path, "w", encoding="utf-8") as f: - f.write(text) - - count += 1 - print(f"Converted: {file} -> {txt_path}") - - except Exception as e: - print(f"Error processing {file}: {e}") - -print(f"\nTotal converted PDFs: {count}") -print(f"Output directory: {txt_root}") -``` - -The resulting text files will form the base corpus for semantic retrieval in later steps. - -Run the Python script to convert all PDFs into text files. - -```bash -python pdf2text.py -``` - -This script converts all PDFs into text files for later embedding. - -### Verify your corpus -You should now see something like this in your folder: -```bash -find ~/rag/text/ -type f -name "*.txt" -exec cat {} + | wc -l -``` - -It will show how many line in total. - - -## Step 4 – Build an Embedding and Search Index - -Convert your prepared text corpus into **vector embeddings** and store them in a **FAISS index** for efficient semantic search. - -This stage enables your RAG pipeline to retrieve the most relevant text chunks when users ask questions. - -| **Component** | **Role** | -|--------------|------------------------------| -| **SentenceTransformer (E5-base-v2)** | Generates vector embeddings for each text chunk | -| **LangChain + FAISS** | Stores and searches embeddings efficiently | -| **RecursiveCharacterTextSplitter** | Splits long documents into manageable text chunks | - -Use **E5-base-v2** to encode the documents and create a FAISS vector index. - -### Create the FAISS builder script - -Save the following as `build_index.py` in `~/rag` - -```bash -mkdir -p ~/rag/faiss_index -``` - -The embedding process (about 10 minutes on CPU) will batch every 100 text chunks for progress logging. - -```python -import os, glob -from tqdm import tqdm - -from langchain_huggingface import HuggingFaceEmbeddings -from langchain_community.vectorstores import FAISS -from langchain_core.documents import Document -from langchain_text_splitters import RecursiveCharacterTextSplitter - -# Paths -data_dir = os.path.expanduser("~/rag/text") -model_dir = os.path.expanduser("~/models/e5-base-v2") -index_dir = os.path.expanduser("~/rag/faiss_index") - -os.makedirs(index_dir, exist_ok=True) - -# Load embedding model (CPU only) -embedder = HuggingFaceEmbeddings( - model_name=model_dir, - model_kwargs={"device": "cpu"} -) - -print(f" Embedder loaded on: {embedder._client.device}") -print(f" Model path: {model_dir}") - -# Collect and split all text files (recursive) -docs = [] -splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100) - -print("\n Scanning and splitting text files...") -for path in glob.glob(os.path.join(data_dir, "**/*.txt"), recursive=True): - with open(path, "r", encoding="utf-8", errors="ignore") as f: - text = f.read() - if not text.strip(): - continue - rel_path = os.path.relpath(path, data_dir) - for chunk in splitter.split_text(text): - docs.append(Document(page_content=chunk, metadata={"source": rel_path})) - -print(f" Total chunks loaded: {len(docs)}") - -# Prepare inputs for embedding -texts = [d.page_content for d in docs] -metadatas = [d.metadata for d in docs] - -""" -# Full embedding with progress logging every 100 chunks -print("\n Embedding text chunks (batch log every 100)...") -embeddings = [] -for i, chunk in enumerate(texts): - embedding = embedder.embed_documents([chunk])[0] - embeddings.append(embedding) - if (i + 1) % 100 == 0 or (i + 1) == len(texts): - print(f" Embedded {i + 1} / {len(texts)} chunks") -""" -# Batch embedding -embeddings = [] -batch_size = 16 -for i in range(0, len(texts), batch_size): - batch_texts = texts[i:i+batch_size] - batch_embeddings = embedder.embed_documents(batch_texts) - embeddings.extend(batch_embeddings) - print(f" Embedded {i + len(batch_texts)} / {len(texts)}") - -# Pair (text, embedding) for FAISS -text_embeddings = list(zip(texts, embeddings)) - -print("\n Saving FAISS index...") -db = FAISS.from_embeddings( - text_embeddings, - embedder, - metadatas=metadatas -) -db.save_local(index_dir) -print(f"\n FAISS index saved to: {index_dir}") -``` - - -**Run it:** -```bash -python build_index.py -``` - -The script will process the corpus, load approximately 6,000 text chunks, and save the resulting FAISS index to: `~/rag/faiss_index` - -You will find two of files inside. -- ***index.faiss*** - - A binary file that stores the vector index built using ***FAISS***. - - It contains the actual embeddings and data structures used for ***efficient similarity search*** (e.g., L2 distance, cosine). - - This file enables fast retrieval of nearest neighbors for any given query vector. -- ***index.pkl*** - - A ***Pickle*** file that stores metadata and original document chunks. - - It maps each vector in index.faiss back to its ***text content and source info*** (e.g., file name). - - Used by LangChain to return human-readable results along with context. - -You can verify the FAISS index using the following script. - -```python -import os -from langchain_community.vectorstores import FAISS -from langchain_huggingface import HuggingFaceEmbeddings -from langchain_core.documents import Document - -model_path = os.path.expanduser("~/models/e5-base-v2") -index_path = os.path.expanduser("~/rag/faiss_index") - -embedder = HuggingFaceEmbeddings(model_name=model_path) -db = FAISS.load_local(index_path, embedder, allow_dangerous_deserialization=True) - -query = "raspberry pi 4 power supply" -results = db.similarity_search(query, k=3) - -for i, r in enumerate(results, 1): - print(f"\nResult {i}") - print(f"Source: {r.metadata.get('source')}") - print(r.page_content[:300], "...") - -query = "Use SWD debug Raspberry Pi Pico" -results = db.similarity_search(query, k=3) - -for i, r in enumerate(results, 4): - print(f"\nResult {i}") - print(f"Source: {r.metadata.get('source')}") - print(r.page_content[:300], "...") -``` - -The results will look like the following: - -``` -Result 1 -Source: cm4io-datasheet.txt -Raspberry Pi Compute Module 4 IO Board. We recommend budgeting 9W for CM4. -If you want to supply an external +5V supply to the board, e.g. via J20 or via PoE J9, then we recommend that L5 be -removed. Removing L5 will prevent the on-board +5V and +3.3V supplies from starting up and +5V coming out of ... - -Result 2 -Source: cm4io-datasheet.txt -power the CM4. There is also an on-board +12V to +3.3V DC-DC converter PSU which is only used for the PCIe slot. The -+12V input feeds the +12V PCIe slot, the external PSU connector and the fan connector directly. If these aren’t being -used then a wider input supply is possible (+7.5V to +28V). -With ... - -Result 3 -Source: cm4io-datasheet.txt -that Raspberry Pi 4 Model B has, and for general usage you should refer to the Raspberry Pi 4 Model B documentation . -The significant difference between CM4IO and Raspberry Pi 4 Model B is the addition of a single PCIe socket. The -CM4IO has been designed as both a reference design for CM4 or to be u ... - -Result 4 -Source: pico-datasheet.txt -mass storage device), or the standard Serial Wire Debug (SWD) port can reset the system and load and run code -without any button presses. The SWD port can also be used to interactively debug code running on the RP2040. -Raspberry Pi Pico Datasheet -Chapter 1. About Raspberry Pi Pico 4 -Getting started ... - -Result 5 -Source: pico-2-datasheet.txt -mass storage device), or the standard Serial Wire Debug (SWD) port can reset the system and load and run code -without any button presses. The SWD port can also be used to interactively debug code running on the RP2350. - TIP -Getting started with Raspberry Pi Pico-series walks through loading progra ... - -Result 6 -Source: pico-w-datasheet.txt -without any button presses. The SWD port can also be used to interactively debug code running on the RP2040. -Getting started with Pico W -The Getting started with Raspberry Pi Pico-series book walks through loading programs onto the -board, and shows how to install the C/C++ SDK and build the example ... -``` - -The execution of `check_index.py` confirmed that your local ***FAISS vector index*** is functioning correctly for semantic search tasks. - -You performed two distinct queries targeting different product lines within the Raspberry Pi ecosystem: ***Raspberry Pi 4 power supply*** and ***Raspberry Pi Pico SWD debugging***. - -- For the first query, ***raspberry pi 4 power supply***, the system returned three highly relevant results, all sourced from the `cm4io-datasheet.txt` file. These passages provided technical guidance on power requirements, supply voltage ranges, and hardware configurations specific to the Compute Module 4 IO Board. This indicates that the embeddings captured the correct semantic intent, and that the FAISS index correctly surfaced content even when specific keywords like ***power supply*** appeared in varied contexts. - -- For the second query, ***Use SWD debug Raspberry Pi Pico***, the search retrieved top results from all three relevant datasheets in the Pico family: `pico-datasheet.txt`, `pico-2-datasheet.txt`, and `pico-w-datasheet.txt`. -The extracted passages consistently explained how the ***Serial Wire Debug (SWD)*** port allows developers to reset the system, load and run code without manual input, and perform interactive debugging on the RP2040 or RP2350 microcontrollers. This demonstrates that your chunking and indexing pipeline accurately retained embedded debugging context, and that metadata mapping correctly links each result to its original source document. - -This process validates that your system can perform semantic retrieval on technical documents — a core capability of any RAG application. - -In summary, both semantic queries were successfully answered using your local vector store, validating that the indexing, embedding, metadata, and retrieval components of your RAG backend are working correctly in a CPU-only configuration. - - -| **Stage** | **Technology** | **Hardware Execution** | **Function** | -|------------|----------------|------------------------|---------------| -| Document Processing | pypdf, python-docx | Grace CPU | Text extraction | -| Embedding | E5-base-v2 (sentence-transformers) | Grace CPU | Vectorization | -| Retrieval | FAISS + LangChain | Grace CPU | Semantic search | -| Generation | llama.cpp REST Server | Blackwell GPU + Grace CPU | Text generation | -| Orchestration | Python RAG Script | Grace CPU | Pipeline control | -| Unified Memory | NVLink-C2C | Shared | Zero-copy data exchange | - -At this point, your environment is fully configured and validated. -You have confirmed that the E5-base-v2 embedding model, FAISS index, and Llama 3.1 8B model are all functioning correctly. - -In the next module, you will integrate all these validated components into a full **Retrieval-Augmented Generation (RAG)** pipeline, combining CPU-based retrieval and GPU-accelerated generation on the ***Grace–Blackwell (GB10)*** platform. \ No newline at end of file +With the development setup, tools, and models prepared, you can create the vector database and add your documents. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2b_rag_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2b_rag_setup.md new file mode 100644 index 000000000..46e280de2 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2b_rag_setup.md @@ -0,0 +1,355 @@ +--- +title: Add documents to the vector database +weight: 4 +layout: "learningpathall" +--- + +## Prepare a sample document corpus + +You are now ready to add your documents to the RAG database that will be used for retrieval and reasoning. + +This converts your raw knowledge documents into clean, chunked text segments that can later be vectorized and indexed by FAISS. + +## Understanding FAISS for vector search + +FAISS (Facebook AI Similarity Search) is an open-source library developed by Meta AI for efficient similarity search and clustering of dense vectors. It's particularly well-suited for RAG applications because it can quickly find the most relevant document chunks from large collections. + +Key advantages of FAISS for this application: + +- CPU efficiency: FAISS is highly optimized for Arm CPUs, making it ideal for the Grace CPU in the GB10 platform +- Scalability: Handles millions of vectors with minimal memory overhead +- Speed: Uses advanced indexing algorithms to perform nearest-neighbor searches in milliseconds +- Flexibility: Supports multiple distance metrics (L2, cosine similarity) and index types + +### Create a workspace and data folder + +Create a directory structure for your data: + +```bash +mkdir -p ~/rag && cd ~/rag +mkdir pdf text +``` + +You can add any PDF data sources to your RAG database. + +For illustration, you can add a number of Raspberry Pi documents that you want to use to find out specific information about the Raspberry Pi products. + +Use a text editor to create a file named `datasheet.txt` listing all data source URLs that will be used for the RAG data. Make sure to include one URL per line. + +```console +https://datasheets.raspberrypi.com/cm/cm1-and-cm3-datasheet.pdf +https://datasheets.raspberrypi.com/cm/cm3-plus-datasheet.pdf +https://datasheets.raspberrypi.com/cm4/cm4-datasheet.pdf +https://datasheets.raspberrypi.com/cm4io/cm4io-datasheet.pdf +https://datasheets.raspberrypi.com/cm4s/cm4s-datasheet.pdf +https://datasheets.raspberrypi.com/pico/pico-2-datasheet.pdf +https://datasheets.raspberrypi.com/pico/pico-datasheet.pdf +https://datasheets.raspberrypi.com/picow/pico-2-w-datasheet.pdf +https://datasheets.raspberrypi.com/picow/pico-w-datasheet.pdf +https://datasheets.raspberrypi.com/rp2040/rp2040-datasheet.pdf +https://datasheets.raspberrypi.com/rp2350/rp2350-datasheet.pdf +https://datasheets.raspberrypi.com/rpi4/raspberry-pi-4-datasheet.pdf +``` + +Use `wget` to batch download all the PDFs into `~/rag/pdf`. + +```bash +wget -P ~/rag/pdf -i datasheet.txt +``` + +### Convert PDF into txt file + +Then, create a Python file named `pdf2text.py` with the code below: + +```python +from pypdf import PdfReader +import glob, os + +pdf_root = os.path.expanduser("~/rag/pdf") +txt_root = os.path.expanduser("~/rag/text") +os.makedirs(txt_root, exist_ok=True) + +count = 0 +for file in glob.glob(os.path.join(pdf_root, "**/*.pdf"), recursive=True): + print(f"File processing {file}") + try: + reader = PdfReader(file) + text = "\n".join(page.extract_text() or "" for page in reader.pages) + + rel_path = os.path.relpath(file, pdf_root) + txt_path = os.path.join(txt_root, os.path.splitext(rel_path)[0] + ".txt") + os.makedirs(os.path.dirname(txt_path), exist_ok=True) + + with open(txt_path, "w", encoding="utf-8") as f: + f.write(text) + + count += 1 + print(f"Converted: {file} -> {txt_path}") + + except Exception as e: + print(f"Error processing {file}: {e}") + +print(f"\nTotal converted PDFs: {count}") +print(f"Output directory: {txt_root}") +``` + +The resulting text files will form the corpus for semantic retrieval. + +Run the Python script to convert all PDFs into text files. + +```bash +python pdf2text.py +``` + +This script converts all PDFs into text files for later embedding. + +At the end of the output you see: + +```output +Total converted PDFs: 12 +``` + +### Verify your corpus + +You should now see a number of files in your folder. Run the command below to inspect the results: + +```bash +find ~/rag/text/ -type f -name "*.txt" -exec cat {} + | wc -l +``` + +It will show how many lines are in total. The number is around 100,000. + +## Build an Embedding and Search Index + +Convert your prepared text corpus into vector embeddings and store them in a FAISS index for efficient semantic search. + +This stage enables your RAG pipeline to retrieve the most relevant text chunks when you ask questions. + +| **Component** | **Role** | +|--------------|------------------------------| +| SentenceTransformer (e5-base-v2) | Generates vector embeddings for each text chunk | +| LangChain and FAISS | Stores and searches embeddings efficiently | +| RecursiveCharacterTextSplitter | Splits long documents into manageable text chunks | + +Use e5-base-v2 to encode the documents and create a FAISS vector index. + +### Create the FAISS builder script + + +```bash +mkdir -p ~/rag/faiss_index +``` + +Create a file named `build_index.py` in `~/rag` that will perform the embedding. + +The embedding process (about 10 minutes on CPU) will batch every 100 text chunks for progress logging. + +```python +import os, glob +from tqdm import tqdm + +from langchain_huggingface import HuggingFaceEmbeddings +from langchain_community.vectorstores import FAISS +from langchain_core.documents import Document +from langchain_text_splitters import RecursiveCharacterTextSplitter + +# Paths +data_dir = os.path.expanduser("~/rag/text") +model_dir = os.path.expanduser("~/models/e5-base-v2") +index_dir = os.path.expanduser("~/rag/faiss_index") + +os.makedirs(index_dir, exist_ok=True) + +# Load embedding model (CPU only) +embedder = HuggingFaceEmbeddings( + model_name=model_dir, + model_kwargs={"device": "cpu"} +) + +print(f" Embedder loaded on: {embedder._client.device}") +print(f" Model path: {model_dir}") + +# Collect and split all text files (recursive) +docs = [] +splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100) + +print("\n Scanning and splitting text files...") +for path in glob.glob(os.path.join(data_dir, "**/*.txt"), recursive=True): + with open(path, "r", encoding="utf-8", errors="ignore") as f: + text = f.read() + if not text.strip(): + continue + rel_path = os.path.relpath(path, data_dir) + for chunk in splitter.split_text(text): + docs.append(Document(page_content=chunk, metadata={"source": rel_path})) + +print(f" Total chunks loaded: {len(docs)}") + +# Prepare inputs for embedding +texts = [d.page_content for d in docs] +metadatas = [d.metadata for d in docs] + +""" +# Full embedding with progress logging every 100 chunks +print("\n Embedding text chunks (batch log every 100)...") +embeddings = [] +for i, chunk in enumerate(texts): + embedding = embedder.embed_documents([chunk])[0] + embeddings.append(embedding) + if (i + 1) % 100 == 0 or (i + 1) == len(texts): + print(f" Embedded {i + 1} / {len(texts)} chunks") +""" +# Batch embedding +embeddings = [] +batch_size = 16 +for i in range(0, len(texts), batch_size): + batch_texts = texts[i:i+batch_size] + batch_embeddings = embedder.embed_documents(batch_texts) + embeddings.extend(batch_embeddings) + print(f" Embedded {i + len(batch_texts)} / {len(texts)}") + +# Pair (text, embedding) for FAISS +text_embeddings = list(zip(texts, embeddings)) + +print("\n Saving FAISS index...") +db = FAISS.from_embeddings( + text_embeddings, + embedder, + metadatas=metadatas +) +db.save_local(index_dir) +print(f"\n FAISS index saved to: {index_dir}") +``` + +Run the code to generate the embeddings: + +```bash +python build_index.py +``` + +The script will process the corpus, load approximately 6,000 text chunks, and save the resulting FAISS index to the `~/rag/faiss_index` directory. + +You will find two files inside. + +- ***index.faiss*** + - A binary file that stores the vector index built using FAISS. + - It contains the actual embeddings and data structures used for efficient similarity search. + - This file enables fast retrieval of nearest neighbors for any given query vector. + +- ***index.pkl*** + - A pickle file that stores metadata and original document chunks. + - It maps each vector in `index.faiss` back to its text content and source info, including file name. + - Used by LangChain to return human-readable results along with context. + +You can verify the FAISS index using the following script. + +Save the code below in `check_index.py`. + +```python +import os +from langchain_community.vectorstores import FAISS +from langchain_huggingface import HuggingFaceEmbeddings +from langchain_core.documents import Document + +model_path = os.path.expanduser("~/models/e5-base-v2") +index_path = os.path.expanduser("~/rag/faiss_index") + +embedder = HuggingFaceEmbeddings(model_name=model_path) +db = FAISS.load_local(index_path, embedder, allow_dangerous_deserialization=True) + +query = "raspberry pi 4 power supply" +results = db.similarity_search(query, k=3) + +for i, r in enumerate(results, 1): + print(f"\nResult {i}") + print(f"Source: {r.metadata.get('source')}") + print(r.page_content[:300], "...") + +query = "Use SWD debug Raspberry Pi Pico" +results = db.similarity_search(query, k=3) + +for i, r in enumerate(results, 4): + print(f"\nResult {i}") + print(f"Source: {r.metadata.get('source')}") + print(r.page_content[:300], "...") +``` + +Run the code using: + +```bash +python check_index.py +``` + +The results will look like the following: + +```output +Result 1 +Source: cm4io-datasheet.txt +Raspberry Pi Compute Module 4 IO Board. We recommend budgeting 9W for CM4. +If you want to supply an external +5V supply to the board, e.g. via J20 or via PoE J9, then we recommend that L5 be +removed. Removing L5 will prevent the on-board +5V and +3.3V supplies from starting up and +5V coming out of ... + +Result 2 +Source: cm4io-datasheet.txt +power the CM4. There is also an on-board +12V to +3.3V DC-DC converter PSU which is only used for the PCIe slot. The ++12V input feeds the +12V PCIe slot, the external PSU connector and the fan connector directly. If these aren’t being +used then a wider input supply is possible (+7.5V to +28V). +With ... + +Result 3 +Source: cm4io-datasheet.txt +that Raspberry Pi 4 Model B has, and for general usage you should refer to the Raspberry Pi 4 Model B documentation . +The significant difference between CM4IO and Raspberry Pi 4 Model B is the addition of a single PCIe socket. The +CM4IO has been designed as both a reference design for CM4 or to be u ... + +Result 4 +Source: pico-datasheet.txt +mass storage device), or the standard Serial Wire Debug (SWD) port can reset the system and load and run code +without any button presses. The SWD port can also be used to interactively debug code running on the RP2040. +Raspberry Pi Pico Datasheet +Chapter 1. About Raspberry Pi Pico 4 +Getting started ... + +Result 5 +Source: pico-2-datasheet.txt +mass storage device), or the standard Serial Wire Debug (SWD) port can reset the system and load and run code +without any button presses. The SWD port can also be used to interactively debug code running on the RP2350. + TIP +Getting started with Raspberry Pi Pico-series walks through loading progra ... + +Result 6 +Source: pico-w-datasheet.txt +without any button presses. The SWD port can also be used to interactively debug code running on the RP2040. +Getting started with Pico W +The Getting started with Raspberry Pi Pico-series book walks through loading programs onto the +board, and shows how to install the C/C++ SDK and build the example ... +``` + +The execution of `check_index.py` confirms that your local FAISS vector index is functioning correctly for semantic search tasks. + +You performed two distinct queries targeting different product lines within the Raspberry Pi ecosystem: "Raspberry Pi 4 power supply" and "Raspberry Pi Pico SWD debugging". + +- For the first query, the system returned three highly relevant results, all sourced from the `cm4io-datasheet.txt` file. These passages provided technical guidance on power requirements, supply voltage ranges, and hardware configurations specific to the Compute Module 4 IO Board. This indicates that the embeddings captured the correct semantic intent and that the FAISS index correctly surfaced content even when specific keywords like "power supply" appeared in varied contexts. + +- For the second query, the search retrieved top results from all three relevant datasheets in the Pico family: `pico-datasheet.txt`, `pico-2-datasheet.txt`, and `pico-w-datasheet.txt`. +The extracted passages consistently explained how the Serial Wire Debug (SWD) port allows developers to reset the system, load and run code without manual input, and perform interactive debugging on the RP2040 or RP2350 microcontrollers. This demonstrates that your chunking and indexing pipeline accurately retained embedded debugging context, and that metadata mapping correctly links each result to its original source document. + +This process validates that your system can perform semantic retrieval on technical documents, a core capability of any RAG application. + +In summary, both semantic queries were successfully answered using your local vector store, validating that the indexing, embedding, metadata, and retrieval components of your RAG backend are working correctly in a CPU-only configuration. + + +| **Stage** | **Technology** | **Hardware Execution** | **Function** | +|------------|----------------|------------------------|---------------| +| Document Processing | pypdf, python-docx | Grace CPU | Text extraction | +| Embedding | e5-base-v2 (sentence-transformers) | Grace CPU | Vectorization | +| Retrieval | FAISS + LangChain | Grace CPU | Semantic search | +| Generation | llama.cpp REST Server | Blackwell GPU + Grace CPU | Text generation | +| Orchestration | Python RAG Script | Grace CPU | Pipeline control | +| Unified Memory | NVLink-C2C | Shared | Zero-copy data exchange | + +At this point, your environment is fully configured and validated. +You have confirmed that the e5-base-v2 embedding model, FAISS index, and Llama 3.1 8B model are all functioning correctly. + +In the next section, you will integrate the validated components into a full Retrieval-Augmented Generation (RAG) pipeline, combining CPU-based retrieval and GPU-accelerated generation. + diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md index e71d91f38..bfc27886a 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md @@ -1,24 +1,25 @@ --- -title: Implementing the RAG Pipeline -weight: 4 +title: Implementing the RAG pipeline +weight: 5 layout: "learningpathall" --- -## Integrating Retrieval and Generation +## Integrating retrieval and generation -In the previous modules, you prepared the environment, validated the ***E5-base-v2*** embedding model, and verified that the ***Llama 3.1 8B*** Instruct model runs successfully on the ***Grace–Blackwell (GB10)*** platform. +In the previous sections, you prepared the environment, validated the e5-base-v2 embedding model, and verified that the Llama 3.1 8B Instruct model runs successfully on the Grace–Blackwell (GB10) platform. -In this module, you will bring all components together to build a complete ***Retrieval-Augmented Generation*** (RAG) workflow. -This stage connects the ***CPU-based retrieval and indexing*** with ***GPU-accelerated language generation***, creating an end-to-end system capable of answering technical questions using real documentation data. +In this section, you will bring all components together to build a complete Retrieval-Augmented Generation (RAG) workflow. + +This stage connects the CPU-based retrieval and indexing with GPU-accelerated language generation, creating an end-to-end system capable of answering technical questions using real documentation data. Building upon the previous modules, you will now: -- Connect the **E5-base-v2** embedding model and FAISS vector index. -- Integrate the **llama.cpp** REST server for GPU-accelerated inference. -- Execute a complete **Retrieval-Augmented Generation** (RAG) workflow for end-to-end question answering. +- Connect the e5-base-v2 embedding model and FAISS vector index. +- Integrate the llama.cpp REST server for GPU-accelerated inference. +- Execute a complete Retrieval-Augmented Generation (RAG) workflow for end-to-end question answering. -### Step 1 – Start the llama.cpp REST Server +### Start the llama.cpp REST server -Before running the RAG query script, ensure the LLM server is active. +Before running the RAG query script, ensure the LLM server is active by running: ```bash cd ~/llama.cpp/build-gpu/ @@ -29,23 +30,24 @@ cd ~/llama.cpp/build-gpu/ ``` Verify the server status from another terminal: + ```bash curl http://127.0.0.1:8000/health ``` -Expected output: -``` +The output is: + +```output {"status":"ok"} ``` +### Create the RAG query script -### Step 2 – Create the RAG Query Script - -This script performs the full pipeline: +This script performs the full pipeline using the flow: -***query*** → ***embedding*** → ***retrieval*** → ***context assembly*** → ***generation*** +User Query ─> Embedding ─> Vector Search ─> Context ─> Generation ─> Answer -Save as rag_query_rest.py under ~/rag/. +Save the code below in a file named `rag_query_rest.py` in the `~/rag` directory. ```bash import os @@ -100,9 +102,14 @@ if __name__ == "__main__": print(answer) ``` -### Step 3 – Execute the RAG Query Script +Make sure you are in the Python virtual environment in each terminal. If needed, run: + +```bash +cd ~/rag +source rag-venv/bin/activate +``` -Then, run the python script to ask the question about ***How many CPU core inside the RaspberryPi 4?*** +Run the python script to ask the question, "How many CPU core inside the RaspberryPi 4?". ```bash python rag_query_rest.py @@ -110,7 +117,7 @@ python rag_query_rest.py You will receive an answer similar to the following. -``` +```output Retrieved sources: 1. cm4-datasheet.txt 2. raspberry-pi-4-datasheet.txt @@ -124,15 +131,20 @@ The Raspberry Pi 4 has 4 CPU cores. The retrieved context referenced three datasheets and produced the correct answer: "4". -Next, let’s ask a more Raspberry Pi 4 hardware-specific question like: what's the default pull setting of GPIO12?` +Try a different question. -Comment out the first question line `answer = rag_query("How many CPU core inside the RaspberryPi 4?")` -and uncomment the second one to test a more detailed query. -`answer = rag_query("On the Raspberry Pi 4, which GPIOs have a default pull-down (pull low) configuration? Please specify the source and the section of the datasheet where this information can be found.")` +Comment out the first question `answer = rag_query("How many CPU core inside the RaspberryPi 4?")` +and uncomment the second question to test a more detailed query. +Run the script again with the new question. -Modify the answer = rag_query("On raspbeery pi 4, what's the default pull of GPIO12?") +```bash +python rag_query_rest.py ``` + +The output is: + +```output Retrieved sources: 1. cm3-plus-datasheet.txt 2. raspberry-pi-4-datasheet.txt @@ -147,11 +159,6 @@ Step 3: Specifically, we are looking for the default pull state of GPIO12. We c Step 4: The table shows that GPIO12 has a default pull state of Low. Step 5: Therefore, the default pull of GPIO12 on a Raspberry Pi 4 is Low. -The final answer is: $\boxed{Low}$ -``` - - -``` Retrieved sources: 1. raspberry-pi-4-datasheet.txt 2. cm4-datasheet.txt @@ -178,22 +185,31 @@ This demonstrates that the RAG system correctly retrieved relevant sources and g You can reference the section 5.1.2 on the PDF to verify the result. -### Step 4 - CPU–GPU Utilization Observation +### Observe CPU and GPU utilization + +If you have installed `htop` and `nvtop`, you can observe CPU and GPU utilization. + +If you do not have them, run: + +```bash +sudo apt install -y nvtop htop +``` + +The screenshots below show `nvtop` on the left and `htop` on the right side. + +![image1 CPU–GPU Utilization screenshot](rag_utilization.jpeg) -Follow the previous (learning path) [https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu/], you can also install `htop` and `nvtop` to observe CPU and GPU utilitization. +From the screenshots, you can see how the Grace CPU and the Blackwell GPU collaborate during RAG execution. -![image1 CPU–GPU Utilization screenshot](rag_utilization.jpeg "CPU–GPU Utilization") +On the left, the GPU utilization graph shows a clear spike reaching 96%, indicating that the llama.cpp inference engine is actively generating tokens on the GPU. -The figure above illustrates how the ***Grace CPU*** and ***Blackwell GPU*** collaborate during ***RAG** execution. -On the left, the GPU utilization graph shows a clear spike reaching ***96%***, indicating that the llama.cpp inference engine is actively generating tokens on the GPU. -Meanwhile, on the right, the htop panel shows multiple Python processes (rag_query_rest.py) running on a single Grace CPU core, maintaining around 93% per-core utilization. +Meanwhile, on the right, `htop` shows multiple Python processes running on the Grace CPU cores, maintaining around 93% per-core utilization. This demonstrates the hybrid execution model of the RAG pipeline: - The Grace CPU handles embedding computation, FAISS retrieval, and orchestration of REST API calls. - The Blackwell GPU performs heavy matrix multiplications for LLM token generation. -- Both operate concurrently within the same Unified Memory space, eliminating data copy overhead between CPU and GPU. +- Both operate concurrently within the same Unified Memory space, eliminating data copy overhead between the CPU and GPU. -You have now connected all components of the RAG pipeline on the ***Grace–Blackwell*** (GB10) platform. -The ***Grace CPU*** handled ***embedding*** and ***FAISS retrieval***, while the ***Blackwell GPU*** generated answers efficiently via llama.cpp REST Server. +You have now connected the components of the RAG pipeline on the GB10 platform. -With the RAG pipeline now complete, the next module will focus on Unified Memory behavior, you will observe Unified Memory behavior to understand how CPU and GPU share data seamlessly within the same memory space. +With the RAG pipeline now complete, the next section focuses on unified memory. You will learn how the CPU and GPU share data seamlessly within the same memory space. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md index fb2c79097..a92318991 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md @@ -1,28 +1,33 @@ --- -title: Observing Unified Memory Collaboration -weight: 5 +title: Observe unified memory performance +weight: 6 layout: "learningpathall" --- -## Observing Unified Memory Collaboration +## Observe unified memory performance -In this module, you will monitor how the ***Grace CPU*** and ***Blackwell GPU*** share data through Unified Memory during RAG execution. +In this section, you will observe how the Grace CPU and Blackwell GPU share data through unified memory during RAG execution. You will start from an idle system state, then progressively launch the model server and run a query, while monitoring both system memory and GPU activity from separate terminals. -Through these real-time observations, you will verify that the Grace–Blackwell Unified Memory architecture enables zero-copy data sharing — allowing both processors to access the same memory space without moving data. +Through these real-time observations, you will verify that the Grace–Blackwell unified memory architecture enables zero-copy data sharing, allowing both processors to access the same memory space without moving data. +Open two terminals on your GB10 system and use them as listed in the table below: | **Terminal** | **Observation Target** | **Purpose** | |----------------------|------------------------|----------------------------------------------------| -| `Monitor Terminal 1` | System memory usage | Observe memory allocation changes as processes run | -| `Monitor Terminal 2` | GPU activity | Track GPU utilization, power draw, and temperature | +| Monitor Terminal 1 | System memory usage | Observe memory allocation changes as processes run | +| Monitor Terminal 2 | GPU activity | Track GPU utilization, power draw, and temperature | -### Step 1 – Experiment Preparation +You should also have your original terminals open that you used to run the `llama-server` and the RAG queries in the previous section. You will run these again and use the two new terminals for observation. + +### Prepare for the experiments Ensure the RAG pipeline is stopped before starting the observation. -#### Monitor Terminal 1 - System Memory Observation +#### Terminal 1 - system memory observation + +Run the Bash commands below in terminal 1 to print the free memory of the system: ```bash while true; do @@ -32,36 +37,44 @@ while true; do done ``` -Example Output: -``` +The output is similar to the following: + +```output [2025-11-07 22:34:24] used=3.5Gi free=106Gi available=116Gi [2025-11-07 22:34:25] used=3.5Gi free=106Gi available=116Gi [2025-11-07 22:34:26] used=3.5Gi free=106Gi available=116Gi [2025-11-07 22:34:27] used=3.5Gi free=106Gi available=116Gi ``` -**Field Explanation:** +The printed fields are: + - `used` — Total memory currently utilized by all active processes. - `free` — Memory not currently allocated or reserved by the system. - `available` — Memory immediately available for new processes, accounting for reclaimable cache and buffers. -#### Monitor Terminal 2 – GPU Status Observation +#### Terminal 2 – GPU status observation + +Run the Bash commands below in terminal 2 to print the GPU statistics: ```bash -sudo stdbuf -oL nvidia-smi --loop-ms=1000 \ +stdbuf -oL nvidia-smi --loop-ms=1000 \ --query-gpu=timestamp,utilization.gpu,utilization.memory,power.draw,temperature.gpu,memory.used \ --format=csv,noheader,nounits ``` -Example Output: <-- format not easy to read -``` +The output is similar to the following: + +```output 2025/11/07 22:38:05.114, 0, 0, 4.43, 36, [N/A] 2025/11/07 22:38:06.123, 0, 0, 4.46, 36, [N/A] 2025/11/07 22:38:07.124, 0, 0, 4.51, 36, [N/A] 2025/11/07 22:38:08.124, 0, 0, 4.51, 36, [N/A] ``` -**Field Output Explanation**: +The format is not easy to read, but following the date and time, there are three key stats being reported: utilization, power, and temperature. The memory-related stats are not used on the GB10 system. + +Here is an explanation of the fields: + | **Field** | **Description** | **Interpretation** | |----------------------|---------------------------|-----------------------------------------------------------------------------| | `timestamp` | Time of data sampling | Used to align GPU metrics with memory log timestamps | @@ -72,9 +85,11 @@ Example Output: <-- format not easy to read | `memory.used` | GPU VRAM usage | GB10 does not include separate VRAM; all data resides within Unified Memory | -### Step 2 – Launch the llama-server +### Run the llama-server -Now, start the `llama.cpp` REST server again in your original terminal (the same flow of previous session) +With the idle condition understood, start the `llama.cpp` REST server again in your original terminal, not the two new terminals being used for observation. + +Here is the command: ```bash cd ~/llama.cpp/build-gpu/ @@ -86,8 +101,9 @@ cd ~/llama.cpp/build-gpu/ Observe both monitoring terminals: -Monitor Terminal 1 -``` +The output in monitor terminal 1 is similar to: + +```output [2025-11-07 22:50:27] used=3.5Gi free=106Gi available=116Gi [2025-11-07 22:50:28] used=3.9Gi free=106Gi available=115Gi [2025-11-07 22:50:29] used=11Gi free=98Gi available=108Gi @@ -97,8 +113,9 @@ Monitor Terminal 1 [2025-11-07 22:50:33] used=12Gi free=97Gi available=106Gi ``` -Monitor Terminal 2 -``` +The output in monitor terminal 2 is similar to: + +```output 2025/11/07 22:50:27.836, 0, 0, 4.39, 35, [N/A] 2025/11/07 22:50:28.836, 0, 0, 6.75, 36, [N/A] 2025/11/07 22:50:29.837, 6, 0, 11.47, 36, [N/A] @@ -110,23 +127,24 @@ Monitor Terminal 2 | **Terminal** | **Observation** | **Behavior** | |--------------------|------------------------------------------------------|-------------------------------------------------| -| Monitor Terminal 1 | used increases by ~8 GiB | Model weights loaded into shared Unified Memory | -| Monitor Terminal 2 | utilization.gpu momentarily spikes, power.draw rises | GPU initialization and model mapping | +| Monitor Terminal 1 | used increases by about 8 GiB | Model weights loaded into shared Unified Memory | +| Monitor Terminal 2 | GPU utilization momentarily spikes and power rises | GPU initialization and model mapping | -This confirms the model is resident in Unified Memory — visible by increased system RAM, but not as GPU VRAM usage. +This confirms the model is resident in unified memory, which is visible by the increased system RAM usage. -## Step 3 – Execute the RAG Query +## Execute the RAG Query -In another terminal (or background session), run: +With the observation code and the `llama-server` still running, run the RAG query in another terminal: ```bash python3 ~/rag/rag_query_rest.py ``` -Monitor Terminal 1 -``` +The output in monitor terminal 1 is similar to: + +```output [2025-11-07 22:53:56] used=12Gi free=97Gi available=106Gi [2025-11-07 22:53:57] used=12Gi free=97Gi available=106Gi [2025-11-07 22:53:58] used=12Gi free=97Gi available=106Gi @@ -145,8 +163,9 @@ Monitor Terminal 1 [2025-11-07 22:54:11] used=12Gi free=97Gi available=106Gi ``` -Monitor Terminal 2 -``` +The output in monitor terminal 2 is similar to: + +```output 2025/11/07 22:53:56.010, 0, 0, 11.24, 41, [N/A] 2025/11/07 22:53:57.010, 0, 0, 11.22, 41, [N/A] 2025/11/07 22:53:58.011, 0, 0, 11.20, 41, [N/A] @@ -173,37 +192,34 @@ Monitor Terminal 2 | 22:54:10 | 0% | 12 W | 12 Gi | Query completed, temporary buffers released | -The GPU executes compute kernels (utilization.gpu ≈ 96%) without reading from GDDR or PCIe. +The GPU executes compute kernels with GPU utilization at 96%, without reading from GDDR or PCIe. -Hence, `utilization.memory=0` and `memory.used=[N/A]` are the clearest signs that data sharing, not data copying, is happening. +The `utilization.memory=0` and `memory.used=[N/A]` metrics are clear signs that data sharing, not data copying, is happening. -### Observe and Interpret Unified Memory Behavior: +### Observe and interpret unified memory behavior This experiment confirms the Grace–Blackwell Unified Memory architecture in action: -- CPU and GPU share the same address space. +- The CPU and GPU share the same address space. - No data transfers occur via PCIe. - Memory activity remains stable while GPU utilization spikes. -Data doesn’t move — computation moves to the data. +Data does not move — computation moves to the data. -The Grace CPU orchestrates retrieval, and the Blackwell GPU performs generation, -both operating within the same Unified Memory pool. +The Grace CPU orchestrates retrieval, and the Blackwell GPU performs generation, both operating within the same Unified Memory pool. -### Summary of Unified Memory Behavior +### Summary of unified memory behavior | **Observation** | **Unified Memory Explanation** | |----------------------------------------------------|----------------------------------------------------------| | Memory increases once (during model loading) | Model weights are stored in shared Unified Memory | | Slight memory increase during query execution | CPU temporarily stores context; GPU accesses it directly | | GPU power increases during computation | GPU cores are actively performing inference | -| No duplicated allocation or data transfer observed | Data is successfully shared between CPU and GPU | - +| No duplicated allocation or data transfer observed | Data is successfully shared between the CPU and GPU | -In this learning path, you have successfully implemented a ***Retrieval-Augmented Generation*** (RAG) pipeline on the ***Grace–Blackwell*** (GB10) platform and observed how the ***Grace CPU*** and ***Blackwell GPU*** operate together within the same ***Unified Memory*** space — sharing data seamlessly, without duplication or explicit data movement. -Through this hands-on experiment, you confirmed that: +Through this experiment, you confirmed that: - The Grace CPU efficiently handles retrieval, embedding, and orchestration tasks. - The Blackwell GPU accelerates generation using data directly from Unified Memory. - The system memory and GPU activity clearly demonstrate zero-copy data sharing. -This exercise highlights how the Grace–Blackwell architecture simplifies hybrid AI development — enabling data to stay in place while computation moves to it, reducing complexity and improving efficiency for next-generation Arm-based AI systems. \ No newline at end of file +This exercise highlights how the Grace–Blackwell architecture simplifies hybrid AI development by reducing complexity and improving efficiency for next-generation Arm-based AI systems. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md index 498f86e1a..36c5addd2 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md @@ -1,5 +1,5 @@ --- -title: End-to-End RAG Pipeline on Grace–Blackwell (GB10) +title: Build a RAG pipeline on NVIDIA DGX Spark draft: true cascade: @@ -7,7 +7,7 @@ cascade: minutes_to_complete: 60 -who_is_this_for: This learning path is designed for developers and engineers who want to understand and implement a Retrieval-Augmented Generation (RAG) pipeline optimized for the Grace–Blackwell (GB10) platform. It is ideal for those interested in exploring how Arm-based Grace CPUs manage local document retrieval and orchestration, while Blackwell GPUs accelerate large language model inference through the open-source llama.cpp REST Server. By the end, learners will understand how to build an efficient hybrid CPU–GPU RAG system that leverages Unified Memory for seamless data sharing between computation layers. +who_is_this_for: This is an advanced topic for developers who want to understand and implement a Retrieval-Augmented Generation (RAG) pipeline on the NVIDIA DGX Spark platform. It is ideal for those interested in exploring how Arm-based Grace CPUs manage local document retrieval and orchestration, while Blackwell GPUs accelerate large language model inference through the open-source llama.cpp REST server. learning_objectives: - Understand how a RAG system combines document retrieval and language model generation. @@ -16,23 +16,19 @@ learning_objectives: - Build a reproducible RAG application that demonstrates efficient hybrid computing. prerequisites: - - One NVIDIA DGX Spark system with at least 15 GB of available disk space. - - Follow the previous [Learning Path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to install both the CPU and GPU builds of llama.cpp. + - An NVIDIA DGX Spark system with at least 15 GB of available disk space. author: Odin Shen ### Tags -skilllevels: Introductory +skilllevels: Advanced subjects: ML armips: - - Cortex-X - Cortex-A operatingsystems: - Linux tools_software_languages: - Python - - C++ - - Bash - llama.cpp further_reading: @@ -40,12 +36,16 @@ further_reading: title: Nvidia DGX Spark link: https://www.nvidia.com/en-gb/products/workstations/dgx-spark/ type: website + - resource: + title: EdgeXpert from MSI + link: https://ipc.msi.com/product_detail/Industrial-Computer-Box-PC/AI-Supercomputer/EdgeXpert-MS-C931 + type: website - resource: title: Nvidia DGX Spark Playbooks link: https://github.com/NVIDIA/dgx-spark-playbooks type: documentation - resource: - title: Arm Learning Path + title: Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark link: https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/ type: Learning Path