Updates

madeline-underwood · madeline-underwood · commit 7d602bf59ccd · 2025-11-24T10:53:40.000Z
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_setup.md
@@ -80,7 +80,7 @@ hf download intfloat/e5-base-v2 --local-dir ~/models/e5-base-v2
 wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -P ~/models/Llama-3.1-8B-gguf
 ```
 
-### Verify the e5-base-v2 model
+## Verify the e5-base-v2 model
 
 Run a Python script to verify that the e5-base-v2 model loads correctly and can generate embeddings.
 
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2b_rag_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2b_rag_setup.md
@@ -133,7 +133,7 @@ This stage enables your RAG pipeline to retrieve the most relevant text chunks w
 
 Use e5-base-v2 to encode the documents and create a FAISS vector index.
 
-### Create and run the FAISS builder script
+## Create and run the FAISS builder script
 
 
 ```bash
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md
@@ -17,7 +17,7 @@ Building upon the previous modules, you will now:
 - Integrate the llama.cpp REST server for GPU-accelerated inference.
 - Execute a complete Retrieval-Augmented Generation (RAG) workflow for end-to-end question answering.
 
-### Start the llama.cpp REST server
+## Start the llama.cpp REST server
 
 Before running the RAG query script, ensure the LLM server is active by running:
 
@@ -185,7 +185,7 @@ This demonstrates that the RAG system correctly retrieved relevant sources and g
 
 You can reference the section 5.1.2 on the PDF to verify the result.
 
-### Observe CPU and GPU utilization
+## Observe CPU and GPU utilization
 
 If you have installed `htop` and `nvtop`, you can observe CPU and GPU utilization.
 
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md
@@ -25,11 +25,11 @@ Open two terminals on your GB10 system and use them as listed in the table below
 You should also have your original terminals open that you used to run the `llama-server` and the RAG queries in the previous section. You will run these again and use the two new terminals for observation.
 
 
-### Prepare for unified memory observation
+## Prepare for unified memory observation
 
 Ensure the RAG pipeline is stopped before starting the observation.
 
-#### Terminal 1 - system memory observation
+### Terminal 1 - system memory observation
 
 Run the Bash commands below in terminal 1 to print the free memory of the system:
 
@@ -56,7 +56,7 @@ The printed fields are:
 - `free` — Memory not currently allocated or reserved by the system.  
 - `available` — Memory immediately available for new processes, accounting for reclaimable cache and buffers.
 
-#### Terminal 2 – GPU status observation
+### Terminal 2 – GPU status observation
 
 Run the Bash commands below in terminal 2 to print the GPU statistics:
 
@@ -89,7 +89,7 @@ Here is an explanation of the fields:
 | `memory.used`        | GPU VRAM usage            | GB10 does not include separate VRAM; all data resides within Unified Memory |
 
 
-### Run the llama-server
+## Run the llama-server
 
 With the idle condition understood, start the `llama.cpp` REST server again in your original terminal, not the two new terminals being used for observation.
 
@@ -200,7 +200,7 @@ The GPU executes compute kernels with GPU utilization at 96%, without reading fr
 
 The `utilization.memory=0` and `memory.used=[N/A]` metrics are clear signs that data sharing, not data copying, is happening.
 
-### Interpret unified memory behavior
+## Interpret unified memory behavior
 
 This experiment confirms the Grace–Blackwell Unified Memory architecture in action:
 - The CPU and GPU share the same address space.
@@ -211,7 +211,7 @@ Data does not move — computation moves to the data.
 
 The Grace CPU orchestrates retrieval, and the Blackwell GPU performs generation, both operating within the same Unified Memory pool.
 
-### Summary of unified memory behavior
+## Summary of unified memory behavior
 
 | **Observation**                                    | **Unified Memory Explanation**                           |
 |----------------------------------------------------|----------------------------------------------------------|