You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,11 +25,11 @@ Open two terminals on your GB10 system and use them as listed in the table below
25
25
You should also have your original terminals open that you used to run the `llama-server` and the RAG queries in the previous section. You will run these again and use the two new terminals for observation.
26
26
27
27
28
-
###Prepare for unified memory observation
28
+
## Prepare for unified memory observation
29
29
30
30
Ensure the RAG pipeline is stopped before starting the observation.
31
31
32
-
####Terminal 1 - system memory observation
32
+
### Terminal 1 - system memory observation
33
33
34
34
Run the Bash commands below in terminal 1 to print the free memory of the system:
35
35
@@ -56,7 +56,7 @@ The printed fields are:
56
56
-`free` — Memory not currently allocated or reserved by the system.
57
57
-`available` — Memory immediately available for new processes, accounting for reclaimable cache and buffers.
58
58
59
-
####Terminal 2 – GPU status observation
59
+
### Terminal 2 – GPU status observation
60
60
61
61
Run the Bash commands below in terminal 2 to print the GPU statistics:
62
62
@@ -89,7 +89,7 @@ Here is an explanation of the fields:
89
89
|`memory.used`| GPU VRAM usage | GB10 does not include separate VRAM; all data resides within Unified Memory |
90
90
91
91
92
-
###Run the llama-server
92
+
## Run the llama-server
93
93
94
94
With the idle condition understood, start the `llama.cpp` REST server again in your original terminal, not the two new terminals being used for observation.
95
95
@@ -200,7 +200,7 @@ The GPU executes compute kernels with GPU utilization at 96%, without reading fr
200
200
201
201
The `utilization.memory=0` and `memory.used=[N/A]` metrics are clear signs that data sharing, not data copying, is happening.
202
202
203
-
###Interpret unified memory behavior
203
+
## Interpret unified memory behavior
204
204
205
205
This experiment confirms the Grace–Blackwell Unified Memory architecture in action:
206
206
- The CPU and GPU share the same address space.
@@ -211,7 +211,7 @@ Data does not move — computation moves to the data.
211
211
212
212
The Grace CPU orchestrates retrieval, and the Blackwell GPU performs generation, both operating within the same Unified Memory pool.
0 commit comments