Name	Name	Last commit message	Last commit date
parent directory ..
.gitignore	.gitignore
README.md	README.md
config.py	config.py
env.donotcommit.example	env.donotcommit.example
rag-tutorial-rhoai.ipynb	rag-tutorial-rhoai.ipynb
rag-tutorial.ipynb	rag-tutorial.ipynb
requirements.txt	requirements.txt

RAG (Retrieval-Augmented Generation) Tutorial

This tutorial demonstrates how to build a RAG pipeline using NVIDIA NeMo Microservices on OpenShift.

Quick Start: For a concise command reference for infrastructure, instances, and demos, see ../../commands.md.

Overview

This example implements a complete RAG workflow:

Document Ingestion: Upload documents to NeMo Data Store
Embedding Generation: Create embeddings using NeMo Embedding NIM
Vector Storage: Store embeddings in NeMo Entity Store
Query Processing: Retrieve relevant documents based on user queries
Response Generation: Generate answers using LlamaStack client (with fallback to direct Chat NIM) with retrieved context
Optional Guardrails: Apply safety guardrails to responses

RHOAI LlamaStack variant

rag-tutorial-rhoai.ipynb is the same RAG flow but uses the RHOAI-deployed LlamaStack (e.g. copilot-llama-stack) for chat instead of the nemo-instances LlamaStack. Use it when your chat model is served via RHOAI’s copilot-llama-stack (no client API key; model id e.g. vllm-inference/redhataillama-31-8b-instruct). Copy env.donotcommit.example to env.donotcommit and set NMS_NAMESPACE; the notebook sets LLAMASTACK_URL and LLAMASTACK_CHAT_MODEL by default.

Deployed Services

✅ NeMo Data Store (v25.08+)
✅ NeMo Entity Store (v25.08+)
✅ NeMo Guardrails (v25.08+) - Optional but recommended
✅ LlamaStack Server - Unified API abstraction layer (deployed via Helm with Bearer token support)
✅ KServe InferenceService with meta/llama-3.2-1b-instruct model
- Service name: Your InferenceService predictor service name
- Must be accessible via Istio service mesh
- LlamaStack must have Istio sidecar injected to communicate with KServe services
✅ Embedding NIM: nv-embedqa-1b-v2 service

Note: The service name for the Chat NIM may differ from the model name. Find your service name:

oc get svc -n <your-namespace> | grep llama
oc get inferenceservice -n <your-namespace> | grep llama

Required Configuration

1. Service Account Token (REQUIRED for LlamaStack)

LlamaStack requires a Kubernetes service account token to authenticate with the KServe InferenceService. This token must be set in env.donotcommit:

Get your service account token:

# Replace <your-namespace> and <service-account-name> with your actual values
# The service account name is typically: <inferenceservice-name>-sa
oc create token <service-account-name> -n <your-namespace> --duration=8760h

Example (replace with your actual service account and namespace):

oc create token my-model-sa -n my-namespace --duration=8760h

Add to env.donotcommit:

NIM_SERVICE_ACCOUNT_TOKEN=eyJhbGciOiJSUzI1NiIsImtpZCI6...  # Your token here

2. Model's External URL (REQUIRED for fallback)

The notebook uses the external HTTPS URL as a fallback when LlamaStack is unavailable. Find your InferenceService external URL:

# Get the external URL of your InferenceService
oc get inferenceservice <your-inferenceservice-name> -n <your-namespace> -o jsonpath='{.status.url}'

Example:

oc get inferenceservice my-model -n my-namespace -o jsonpath='{.status.url}'
# Example output: https://my-model-my-namespace.apps.my-cluster.example.com

Add to env.donotcommit (if not auto-detected): The config.py file should auto-detect this, but you can override it if needed.

3. Istio Service Mesh Membership

Your namespace must be a member of the Istio service mesh for LlamaStack to communicate with KServe InferenceService:

# Check if namespace is in the mesh
oc get servicemeshmember -n <your-namespace>

# If not, add it (requires cluster admin or service mesh admin)
# This is typically done during initial setup

4. LlamaStack Configuration

Important: LlamaStack deployment depends on the InferenceService being deployed first. The LlamaStack pod will be in Pending state until the InferenceService creates the required service account.

LlamaStack must be deployed with:

✅ Istio sidecar injection enabled (sidecar.istio.io/inject: "true")
✅ Bearer token authentication enabled (llamastack.useBearerToken: true)
✅ Service account token configured

These are typically configured in the Helm chart values. Verify LlamaStack deployment:

# Check LlamaStack pod status (may be Pending until InferenceService is deployed)
oc get pods -n <your-namespace> | grep llamastack

# Once deployed, verify Istio sidecar is present
oc get pod -n <your-namespace> -l app=nemo-llamastack -o jsonpath='{.items[0].spec.containers[*].name}'
# Should show: llamastack-ctr istio-proxy

Note: If LlamaStack pod is in Pending state with error about missing service account, this is expected. Deploy your InferenceService first, and LlamaStack will automatically deploy once the service account is created.

Python Environment

Python 3.8+
Jupyter Lab
llama-stack-client (installed automatically in notebook)

Quick Start

Run in Workbench/Notebook (Cluster Mode)

The notebook runs in a Workbench/Notebook within the cluster and uses cluster-internal service URLs.

Copy the notebook to the Workbench/Notebook pod:

# Replace <your-namespace> with your actual namespace (find with: oc projects)
NAMESPACE=<your-namespace>

# Get the Workbench/Notebook pod name
JUPYTER_POD=$(oc get pods -n $NAMESPACE -l app=jupyter-notebook -o jsonpath='{.items[0].metadata.name}')

# Copy the RAG demo files to the pod
oc cp demos/rag/rag-tutorial.ipynb $JUPYTER_POD:/work -n $NAMESPACE
oc cp demos/rag/config.py $JUPYTER_POD:/work -n $NAMESPACE
oc cp demos/rag/requirements.txt $JUPYTER_POD:/work -n $NAMESPACE
oc cp demos/rag/env.donotcommit.example $JUPYTER_POD:/work -n $NAMESPACE

Access Workbench/Notebook in the cluster:

# Port-forward Workbench/Notebook
# Replace <your-namespace> with your actual namespace
oc port-forward -n <your-namespace> svc/jupyter-service 8888:8888

Open Workbench/Notebook in browser: http://localhost:8888 (token: token)
Install dependencies in the notebook:

The notebook uses cluster-internal service URLs automatically. No port-forwards needed for services!

Configure Environment

🔒 SECURITY: This demo uses env.donotcommit file for sensitive configuration. The file is git-ignored and will NOT be committed.

Create env.donotcommit file from the template:

# Copy the template
cp env.donotcommit.example env.donotcommit

# Edit env.donotcommit and fill in your values

Required Configuration in env.donotcommit:

Namespace (REQUIRED):

NMS_NAMESPACE=<your-namespace>

Find your namespace:

oc projects

Service Account Token (REQUIRED for LlamaStack):

NIM_SERVICE_ACCOUNT_TOKEN=<your-service-account-token>

Get your token:

# Replace with your actual service account name (typically: <inferenceservice-name>-sa)
# Example: oc create token my-model-sa -n my-namespace --duration=8760h
oc create token <service-account-name> -n <your-namespace> --duration=8760h

Model External URL (REQUIRED for fallback): The external URL is typically auto-detected from the InferenceService, but you can verify:

oc get inferenceservice <your-inferenceservice-name> -n <your-namespace> -o jsonpath='{.status.url}'

Optional Configuration:

NDS_TOKEN=token - NeMo Data Store token (default: "token")
DATASET_NAME=rag-tutorial-documents - Dataset name for RAG documents
RAG_TOP_K=5 - Number of documents to retrieve
RAG_SIMILARITY_THRESHOLD=0.3 - Similarity threshold for retrieval

Find your service names:

# Chat NIM service (KServe InferenceService)
oc get inferenceservice -n <your-namespace>
oc get svc -n <your-namespace> | grep predictor

# Embedding NIM service
oc get svc -n <your-namespace> | grep embedqa

Configuration

The notebook uses config.py which:

Sets up cluster-internal service URLs automatically
Loads configuration from env.donotcommit file (git-ignored, secure)

🔒 Security: All sensitive values (tokens, API keys) are loaded from env.donotcommit file, which is git-ignored and will NOT be committed to version control.

Service URLs

Cluster Mode (Workbench/Notebook within cluster):

Data Store: http://nemodatastore-sample.{namespace}.svc.cluster.local:8000
Entity Store: http://nemoentitystore-sample.{namespace}.svc.cluster.local:8000
Guardrails: http://nemoguardrails-sample.{namespace}.svc.cluster.local:8000
Chat NIM: http://meta-llama3-1b-instruct.{namespace}.svc.cluster.local:8000
Embedding NIM: http://nv-embedqa-1b-v2.{namespace}.svc.cluster.local:8000
LlamaStack: http://llamastack.{namespace}.svc.cluster.local:8321

RAG Workflow

1. Document Ingestion

Upload documents (PDFs, text files, etc.) to NeMo Data Store
Register dataset using LlamaStack's client.beta.datasets.register() API for Data Store
Register dataset in Entity Store using direct HTTP API (required for some NeMo services)
Documents are stored in a namespace for organization

2. Embedding Generation

Use NeMo Embedding NIM (nv-embedqa-1b-v2) to generate embeddings
Each document chunk is converted to a vector representation

3. Vector Storage

Store embeddings and metadata in NeMo Entity Store
Entity Store provides vector similarity search capabilities

4. Query Processing

User submits a query
Query is embedded using the same embedding model
Similarity search retrieves top-K most relevant documents

5. Response Generation

Retrieved documents are used as context
LlamaStack client generates a response using chat completions API (with fallback to direct NIM)
Optional: Guardrails validate response safety

Customization

Adjusting Retrieval Parameters

# In config.py or notebook
RAG_TOP_K = 5  # Number of documents to retrieve
RAG_SIMILARITY_THRESHOLD = 0.3  # Minimum similarity score

Using Different Models

The notebook uses:

Chat Model: meta/llama-3.2-1b-instruct (via NIM service)
Embedding Model: nv-embedqa-1b-v2 (via NIM service)

Note: The service name may differ from the model name. For example, the model meta/llama-3.2-1b-instruct might be deployed as service meta-llama3-1b-instruct. Find your service name:

oc get svc -n <your-namespace> | grep llama

To use different models, update the service names in config.py or set them in env.donotcommit.

Adding Guardrails

Guardrails can be integrated to:

Filter unsafe content
Validate response quality
Enforce compliance policies

Troubleshooting

Embedding NIM: Connection Refused / Pod Pending (Insufficient nvidia.com/gpu)

If you see Connection refused to nv-embedqa-1b-v2 and the pod is Pending with 0/x nodes available: x Insufficient nvidia.com/gpu:

The Embedding NIM (nv-embedqa-1b-v2) requests 1 GPU. No node has an available GPU, so the pod never starts.
Options:
1. Free a GPU: Scale down or delete other GPU workloads so the embedding NIM pod can schedule. Check: oc get pods -n <namespace> -o wide and look for GPU-using pods.
2. Add GPU nodes to the cluster if you have no (or insufficient) GPU capacity.
3. Use the notebook’s CPU fallback: The RAG notebook tries sentence-transformers (all-MiniLM-L6-v2) on CPU when NIM is unreachable. Install once: %pip install sentence-transformers, then re-run the “Generate embeddings” cell. You’ll see a message like “Using CPU fallback: sentence-transformers/all-MiniLM-L6-v2”. Retrieval quality may differ slightly from NIM but the tutorial runs end-to-end.
To disable the embedding NIM and save GPU for other workloads, set nimCacheEmbedding.enabled: false and nimPipelineEmbedding.enabled: false in the nemo-instances Helm values (they are already false by default).

Documents Not Retrieving

Verify documents were uploaded to Data Store
Check embeddings were generated and stored in Entity Store
Verify similarity threshold is not too high

Service Connection Errors

Verify all services are running: oc get pods -n <your-namespace>
Check service URLs in config.py match your deployment
Verify env.donotcommit file exists and has correct NMS_NAMESPACE value
Ensure you're running the notebook in a Workbench/Notebook within the cluster

LlamaStack Connection Errors

If LlamaStack is failing with 500 errors or connection issues:

Verify LlamaStack has Istio sidecar:

oc get pod -n <your-namespace> -l app=nemo-llamastack -o jsonpath='{.items[0].spec.containers[*].name}'
# Should show: llamastack-ctr istio-proxy

Verify service account token is set:

# Check token is in env.donotcommit
grep NIM_SERVICE_ACCOUNT_TOKEN env.donotcommit

# Verify token is valid (should not be empty)
oc create token <service-account-name> -n <your-namespace> --duration=8760h

Verify namespace is in Istio mesh:

oc get servicemeshmember -n <your-namespace>
# Should show your namespace as a member

Check LlamaStack logs:

oc logs -n <your-namespace> -l app=nemo-llamastack --tail=100

Verify KServe InferenceService is accessible:

# Test from within the cluster (from a pod with Istio sidecar)
oc exec -n <your-namespace> <llamastack-pod> -- curl -s http://<predictor-service>.<namespace>.svc.cluster.local:80/v1/models

Fallback works: If LlamaStack fails, the notebook automatically falls back to direct NIM calls using the external HTTPS URL with the service account token.

Version Compatibility

NeMo Data Store: v25.08+
NeMo Entity Store: v25.08+
NeMo Guardrails: v25.08+
Chat NIM: meta/llama-3.2-1b-instruct:1.8.3 (service name may vary)
Embedding NIM: nvidia/llama-3.2-nv-embedqa-1b-v2 (via NIM service)

LlamaStack Integration

This demo uses LlamaStack for both chat completions and dataset registration, providing a unified API abstraction layer over NeMo microservices. The integration:

Uses LlamaStack client for chat completions (with automatic fallback to direct NIM if LlamaStack is unavailable)
Uses LlamaStack's client.beta.datasets.register() API for Data Store dataset registration
Uses direct HTTP requests for Entity Store dataset registration (required for some NeMo services like Customizer and Evaluator)
Uses direct HTTP requests for Data Store namespace operations (namespace creation)
Uses direct NIM calls for embeddings (as LlamaStack may not expose embeddings API directly)
Maintains backward compatibility - works even if LlamaStack service is not deployed (with reduced functionality)

Dataset Registration with LlamaStack

The demo uses LlamaStack's dataset registration API for Data Store:

response = client.beta.datasets.register(
    purpose="post-training/messages",
    dataset_id=DATASET_NAME,
    source={
        "type": "uri",
        "uri": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}"
    },
    metadata={
        "format": "json",
        "description": f"RAG tutorial documents for {DATASET_NAME}",
        "provider_id": "nvidia",
    }
)

This registers the dataset in Data Store using the hf://datasets/ URI format, which references Data Store's HuggingFace-compatible API.

Entity Store Registration

Entity Store registration is still done via direct HTTP API, as LlamaStack's client API does not expose Entity Store operations:

response = requests.post(
    f"{ENTITY_STORE_URL}/v1/datasets",
    json={
        "name": DATASET_NAME,
        "namespace": NMS_NAMESPACE,
        "description": f"RAG tutorial documents for {DATASET_NAME}",
        "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}",
        "project": "rag-tutorial",
        "format": "json",
    },
)

Entity Store registration is required for some NeMo services (like Customizer and Evaluator) that need to reference datasets by their Entity Store ID.

LlamaStack Benefits

Type safety: Pydantic models instead of raw JSON
Unified API: Single client for inference operations
Better error handling: Typed exceptions
Simplified code: Less boilerplate than direct REST calls for chat completions

Files

rag-tutorial.ipynb - Main tutorial notebook (nemo-instances LlamaStack)
rag-tutorial-rhoai.ipynb - Same RAG flow using RHOAI LlamaStack (copilot-llama-stack)
config.py - Configuration file (cluster mode, includes LlamaStack URL)
requirements.txt - Python dependencies (includes llama-stack-client)
../../commands.md - Quick command reference guide (concise version without detailed explanations)
env.donotcommit.example - Template for environment configuration (copy to env.donotcommit)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

RAG (Retrieval-Augmented Generation) Tutorial

Overview

RHOAI LlamaStack variant

Deployed Services

Required Configuration

1. Service Account Token (REQUIRED for LlamaStack)

2. Model's External URL (REQUIRED for fallback)

3. Istio Service Mesh Membership

4. LlamaStack Configuration

Python Environment

Quick Start

Run in Workbench/Notebook (Cluster Mode)

Configuration

Service URLs

RAG Workflow

1. Document Ingestion

2. Embedding Generation

3. Vector Storage

4. Query Processing

5. Response Generation

Customization

Adjusting Retrieval Parameters

Using Different Models

Adding Guardrails

Troubleshooting

Embedding NIM: Connection Refused / Pod Pending (Insufficient nvidia.com/gpu)

Documents Not Retrieving

Service Connection Errors

LlamaStack Connection Errors

Version Compatibility

LlamaStack Integration

Dataset Registration with LlamaStack

Entity Store Registration

LlamaStack Benefits

Files

Documentation

FilesExpand file tree

rag

Directory actions

More options

Directory actions

More options

Latest commit

History

rag

Folders and files

parent directory

README.md

RAG (Retrieval-Augmented Generation) Tutorial

Overview

RHOAI LlamaStack variant

Deployed Services

Required Configuration

1. Service Account Token (REQUIRED for LlamaStack)

2. Model's External URL (REQUIRED for fallback)

3. Istio Service Mesh Membership

4. LlamaStack Configuration

Python Environment

Quick Start

Run in Workbench/Notebook (Cluster Mode)

Configuration

Service URLs

RAG Workflow

1. Document Ingestion

2. Embedding Generation

3. Vector Storage

4. Query Processing

5. Response Generation

Customization

Adjusting Retrieval Parameters

Using Different Models

Adding Guardrails

Troubleshooting

Embedding NIM: Connection Refused / Pod Pending (Insufficient nvidia.com/gpu)

Documents Not Retrieving

Service Connection Errors

LlamaStack Connection Errors

Version Compatibility

LlamaStack Integration

Dataset Registration with LlamaStack

Entity Store Registration

LlamaStack Benefits

Files

Documentation