This tutorial demonstrates how to build a RAG pipeline using NVIDIA NeMo Microservices on OpenShift.
Quick Start: For a concise command reference for infrastructure, instances, and demos, see ../../commands.md.
This example implements a complete RAG workflow:
- Document Ingestion: Upload documents to NeMo Data Store
- Embedding Generation: Create embeddings using NeMo Embedding NIM
- Vector Storage: Store embeddings in NeMo Entity Store
- Query Processing: Retrieve relevant documents based on user queries
- Response Generation: Generate answers using LlamaStack client (with fallback to direct Chat NIM) with retrieved context
- Optional Guardrails: Apply safety guardrails to responses
rag-tutorial-rhoai.ipynb is the same RAG flow but uses the RHOAI-deployed LlamaStack (e.g. copilot-llama-stack) for chat instead of the nemo-instances LlamaStack. Use it when your chat model is served via RHOAI’s copilot-llama-stack (no client API key; model id e.g. vllm-inference/redhataillama-31-8b-instruct). Copy env.donotcommit.example to env.donotcommit and set NMS_NAMESPACE; the notebook sets LLAMASTACK_URL and LLAMASTACK_CHAT_MODEL by default.
- ✅ NeMo Data Store (v25.08+)
- ✅ NeMo Entity Store (v25.08+)
- ✅ NeMo Guardrails (v25.08+) - Optional but recommended
- ✅ LlamaStack Server - Unified API abstraction layer (deployed via Helm with Bearer token support)
- ✅ KServe InferenceService with
meta/llama-3.2-1b-instructmodel- Service name: Your InferenceService predictor service name
- Must be accessible via Istio service mesh
- LlamaStack must have Istio sidecar injected to communicate with KServe services
- ✅ Embedding NIM:
nv-embedqa-1b-v2service
Note: The service name for the Chat NIM may differ from the model name. Find your service name:
oc get svc -n <your-namespace> | grep llama
oc get inferenceservice -n <your-namespace> | grep llamaLlamaStack requires a Kubernetes service account token to authenticate with the KServe InferenceService. This token must be set in env.donotcommit:
Get your service account token:
# Replace <your-namespace> and <service-account-name> with your actual values
# The service account name is typically: <inferenceservice-name>-sa
oc create token <service-account-name> -n <your-namespace> --duration=8760hExample (replace with your actual service account and namespace):
oc create token my-model-sa -n my-namespace --duration=8760hAdd to env.donotcommit:
NIM_SERVICE_ACCOUNT_TOKEN=eyJhbGciOiJSUzI1NiIsImtpZCI6... # Your token hereThe notebook uses the external HTTPS URL as a fallback when LlamaStack is unavailable. Find your InferenceService external URL:
# Get the external URL of your InferenceService
oc get inferenceservice <your-inferenceservice-name> -n <your-namespace> -o jsonpath='{.status.url}'Example:
oc get inferenceservice my-model -n my-namespace -o jsonpath='{.status.url}'
# Example output: https://my-model-my-namespace.apps.my-cluster.example.comAdd to env.donotcommit (if not auto-detected):
The config.py file should auto-detect this, but you can override it if needed.
Your namespace must be a member of the Istio service mesh for LlamaStack to communicate with KServe InferenceService:
# Check if namespace is in the mesh
oc get servicemeshmember -n <your-namespace>
# If not, add it (requires cluster admin or service mesh admin)
# This is typically done during initial setupImportant: LlamaStack deployment depends on the InferenceService being deployed first. The LlamaStack pod will be in Pending state until the InferenceService creates the required service account.
LlamaStack must be deployed with:
- ✅ Istio sidecar injection enabled (
sidecar.istio.io/inject: "true") - ✅ Bearer token authentication enabled (
llamastack.useBearerToken: true) - ✅ Service account token configured
These are typically configured in the Helm chart values. Verify LlamaStack deployment:
# Check LlamaStack pod status (may be Pending until InferenceService is deployed)
oc get pods -n <your-namespace> | grep llamastack
# Once deployed, verify Istio sidecar is present
oc get pod -n <your-namespace> -l app=nemo-llamastack -o jsonpath='{.items[0].spec.containers[*].name}'
# Should show: llamastack-ctr istio-proxyNote: If LlamaStack pod is in Pending state with error about missing service account, this is expected. Deploy your InferenceService first, and LlamaStack will automatically deploy once the service account is created.
- Python 3.8+
- Jupyter Lab
llama-stack-client(installed automatically in notebook)
The notebook runs in a Workbench/Notebook within the cluster and uses cluster-internal service URLs.
- Copy the notebook to the Workbench/Notebook pod:
# Replace <your-namespace> with your actual namespace (find with: oc projects)
NAMESPACE=<your-namespace>
# Get the Workbench/Notebook pod name
JUPYTER_POD=$(oc get pods -n $NAMESPACE -l app=jupyter-notebook -o jsonpath='{.items[0].metadata.name}')
# Copy the RAG demo files to the pod
oc cp demos/rag/rag-tutorial.ipynb $JUPYTER_POD:/work -n $NAMESPACE
oc cp demos/rag/config.py $JUPYTER_POD:/work -n $NAMESPACE
oc cp demos/rag/requirements.txt $JUPYTER_POD:/work -n $NAMESPACE
oc cp demos/rag/env.donotcommit.example $JUPYTER_POD:/work -n $NAMESPACE- Access Workbench/Notebook in the cluster:
# Port-forward Workbench/Notebook
# Replace <your-namespace> with your actual namespace
oc port-forward -n <your-namespace> svc/jupyter-service 8888:8888-
Open Workbench/Notebook in browser: http://localhost:8888 (token:
token) -
Install dependencies in the notebook:
The notebook uses cluster-internal service URLs automatically. No port-forwards needed for services!
- Configure Environment
🔒 SECURITY: This demo uses env.donotcommit file for sensitive configuration. The file is git-ignored and will NOT be committed.
Create env.donotcommit file from the template:
# Copy the template
cp env.donotcommit.example env.donotcommit
# Edit env.donotcommit and fill in your valuesRequired Configuration in env.donotcommit:
- Namespace (REQUIRED):
NMS_NAMESPACE=<your-namespace>Find your namespace:
oc projects- Service Account Token (REQUIRED for LlamaStack):
NIM_SERVICE_ACCOUNT_TOKEN=<your-service-account-token>Get your token:
# Replace with your actual service account name (typically: <inferenceservice-name>-sa)
# Example: oc create token my-model-sa -n my-namespace --duration=8760h
oc create token <service-account-name> -n <your-namespace> --duration=8760h- Model External URL (REQUIRED for fallback): The external URL is typically auto-detected from the InferenceService, but you can verify:
oc get inferenceservice <your-inferenceservice-name> -n <your-namespace> -o jsonpath='{.status.url}'Optional Configuration:
NDS_TOKEN=token- NeMo Data Store token (default: "token")DATASET_NAME=rag-tutorial-documents- Dataset name for RAG documentsRAG_TOP_K=5- Number of documents to retrieveRAG_SIMILARITY_THRESHOLD=0.3- Similarity threshold for retrieval
Find your service names:
# Chat NIM service (KServe InferenceService)
oc get inferenceservice -n <your-namespace>
oc get svc -n <your-namespace> | grep predictor
# Embedding NIM service
oc get svc -n <your-namespace> | grep embedqaThe notebook uses config.py which:
- Sets up cluster-internal service URLs automatically
- Loads configuration from
env.donotcommitfile (git-ignored, secure)
🔒 Security: All sensitive values (tokens, API keys) are loaded from env.donotcommit file, which is git-ignored and will NOT be committed to version control.
Cluster Mode (Workbench/Notebook within cluster):
- Data Store:
http://nemodatastore-sample.{namespace}.svc.cluster.local:8000 - Entity Store:
http://nemoentitystore-sample.{namespace}.svc.cluster.local:8000 - Guardrails:
http://nemoguardrails-sample.{namespace}.svc.cluster.local:8000 - Chat NIM:
http://meta-llama3-1b-instruct.{namespace}.svc.cluster.local:8000 - Embedding NIM:
http://nv-embedqa-1b-v2.{namespace}.svc.cluster.local:8000 - LlamaStack:
http://llamastack.{namespace}.svc.cluster.local:8321
- Upload documents (PDFs, text files, etc.) to NeMo Data Store
- Register dataset using LlamaStack's
client.beta.datasets.register()API for Data Store - Register dataset in Entity Store using direct HTTP API (required for some NeMo services)
- Documents are stored in a namespace for organization
- Use NeMo Embedding NIM (
nv-embedqa-1b-v2) to generate embeddings - Each document chunk is converted to a vector representation
- Store embeddings and metadata in NeMo Entity Store
- Entity Store provides vector similarity search capabilities
- User submits a query
- Query is embedded using the same embedding model
- Similarity search retrieves top-K most relevant documents
- Retrieved documents are used as context
- LlamaStack client generates a response using chat completions API (with fallback to direct NIM)
- Optional: Guardrails validate response safety
# In config.py or notebook
RAG_TOP_K = 5 # Number of documents to retrieve
RAG_SIMILARITY_THRESHOLD = 0.3 # Minimum similarity scoreThe notebook uses:
- Chat Model:
meta/llama-3.2-1b-instruct(via NIM service) - Embedding Model:
nv-embedqa-1b-v2(via NIM service)
Note: The service name may differ from the model name. For example, the model meta/llama-3.2-1b-instruct might be deployed as service meta-llama3-1b-instruct. Find your service name:
oc get svc -n <your-namespace> | grep llamaTo use different models, update the service names in config.py or set them in env.donotcommit.
Guardrails can be integrated to:
- Filter unsafe content
- Validate response quality
- Enforce compliance policies
If you see Connection refused to nv-embedqa-1b-v2 and the pod is Pending with 0/x nodes available: x Insufficient nvidia.com/gpu:
- The Embedding NIM (nv-embedqa-1b-v2) requests 1 GPU. No node has an available GPU, so the pod never starts.
- Options:
- Free a GPU: Scale down or delete other GPU workloads so the embedding NIM pod can schedule. Check:
oc get pods -n <namespace> -o wideand look for GPU-using pods. - Add GPU nodes to the cluster if you have no (or insufficient) GPU capacity.
- Use the notebook’s CPU fallback: The RAG notebook tries sentence-transformers (all-MiniLM-L6-v2) on CPU when NIM is unreachable. Install once:
%pip install sentence-transformers, then re-run the “Generate embeddings” cell. You’ll see a message like “Using CPU fallback: sentence-transformers/all-MiniLM-L6-v2”. Retrieval quality may differ slightly from NIM but the tutorial runs end-to-end.
- Free a GPU: Scale down or delete other GPU workloads so the embedding NIM pod can schedule. Check:
- To disable the embedding NIM and save GPU for other workloads, set
nimCacheEmbedding.enabled: falseandnimPipelineEmbedding.enabled: falsein the nemo-instances Helm values (they are already false by default).
- Verify documents were uploaded to Data Store
- Check embeddings were generated and stored in Entity Store
- Verify similarity threshold is not too high
- Verify all services are running:
oc get pods -n <your-namespace> - Check service URLs in
config.pymatch your deployment - Verify
env.donotcommitfile exists and has correctNMS_NAMESPACEvalue - Ensure you're running the notebook in a Workbench/Notebook within the cluster
If LlamaStack is failing with 500 errors or connection issues:
- Verify LlamaStack has Istio sidecar:
oc get pod -n <your-namespace> -l app=nemo-llamastack -o jsonpath='{.items[0].spec.containers[*].name}'
# Should show: llamastack-ctr istio-proxy- Verify service account token is set:
# Check token is in env.donotcommit
grep NIM_SERVICE_ACCOUNT_TOKEN env.donotcommit
# Verify token is valid (should not be empty)
oc create token <service-account-name> -n <your-namespace> --duration=8760h- Verify namespace is in Istio mesh:
oc get servicemeshmember -n <your-namespace>
# Should show your namespace as a member- Check LlamaStack logs:
oc logs -n <your-namespace> -l app=nemo-llamastack --tail=100- Verify KServe InferenceService is accessible:
# Test from within the cluster (from a pod with Istio sidecar)
oc exec -n <your-namespace> <llamastack-pod> -- curl -s http://<predictor-service>.<namespace>.svc.cluster.local:80/v1/models- Fallback works: If LlamaStack fails, the notebook automatically falls back to direct NIM calls using the external HTTPS URL with the service account token.
- NeMo Data Store: v25.08+
- NeMo Entity Store: v25.08+
- NeMo Guardrails: v25.08+
- Chat NIM:
meta/llama-3.2-1b-instruct:1.8.3(service name may vary) - Embedding NIM:
nvidia/llama-3.2-nv-embedqa-1b-v2(via NIM service)
This demo uses LlamaStack for both chat completions and dataset registration, providing a unified API abstraction layer over NeMo microservices. The integration:
- Uses LlamaStack client for chat completions (with automatic fallback to direct NIM if LlamaStack is unavailable)
- Uses LlamaStack's
client.beta.datasets.register()API for Data Store dataset registration - Uses direct HTTP requests for Entity Store dataset registration (required for some NeMo services like Customizer and Evaluator)
- Uses direct HTTP requests for Data Store namespace operations (namespace creation)
- Uses direct NIM calls for embeddings (as LlamaStack may not expose embeddings API directly)
- Maintains backward compatibility - works even if LlamaStack service is not deployed (with reduced functionality)
The demo uses LlamaStack's dataset registration API for Data Store:
response = client.beta.datasets.register(
purpose="post-training/messages",
dataset_id=DATASET_NAME,
source={
"type": "uri",
"uri": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}"
},
metadata={
"format": "json",
"description": f"RAG tutorial documents for {DATASET_NAME}",
"provider_id": "nvidia",
}
)This registers the dataset in Data Store using the hf://datasets/ URI format, which references Data Store's HuggingFace-compatible API.
Entity Store registration is still done via direct HTTP API, as LlamaStack's client API does not expose Entity Store operations:
response = requests.post(
f"{ENTITY_STORE_URL}/v1/datasets",
json={
"name": DATASET_NAME,
"namespace": NMS_NAMESPACE,
"description": f"RAG tutorial documents for {DATASET_NAME}",
"files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}",
"project": "rag-tutorial",
"format": "json",
},
)Entity Store registration is required for some NeMo services (like Customizer and Evaluator) that need to reference datasets by their Entity Store ID.
- Type safety: Pydantic models instead of raw JSON
- Unified API: Single client for inference operations
- Better error handling: Typed exceptions
- Simplified code: Less boilerplate than direct REST calls for chat completions
rag-tutorial.ipynb- Main tutorial notebook (nemo-instances LlamaStack)rag-tutorial-rhoai.ipynb- Same RAG flow using RHOAI LlamaStack (copilot-llama-stack)config.py- Configuration file (cluster mode, includes LlamaStack URL)requirements.txt- Python dependencies (includes llama-stack-client)../../commands.md- Quick command reference guide (concise version without detailed explanations)env.donotcommit.example- Template for environment configuration (copy toenv.donotcommit)