Skip to content

Latest commit

 

History

History

README.md

RAG (Retrieval-Augmented Generation) Tutorial

This tutorial demonstrates how to build a RAG pipeline using NVIDIA NeMo Microservices on OpenShift.

Quick Start: For a concise command reference for infrastructure, instances, and demos, see ../../commands.md.

Overview

This example implements a complete RAG workflow:

  1. Document Ingestion: Upload documents to NeMo Data Store
  2. Embedding Generation: Create embeddings using NeMo Embedding NIM
  3. Vector Storage: Store embeddings in NeMo Entity Store
  4. Query Processing: Retrieve relevant documents based on user queries
  5. Response Generation: Generate answers using LlamaStack client (with fallback to direct Chat NIM) with retrieved context
  6. Optional Guardrails: Apply safety guardrails to responses

RHOAI LlamaStack variant

rag-tutorial-rhoai.ipynb is the same RAG flow but uses the RHOAI-deployed LlamaStack (e.g. copilot-llama-stack) for chat instead of the nemo-instances LlamaStack. Use it when your chat model is served via RHOAI’s copilot-llama-stack (no client API key; model id e.g. vllm-inference/redhataillama-31-8b-instruct). Copy env.donotcommit.example to env.donotcommit and set NMS_NAMESPACE; the notebook sets LLAMASTACK_URL and LLAMASTACK_CHAT_MODEL by default.

Deployed Services

  • ✅ NeMo Data Store (v25.08+)
  • ✅ NeMo Entity Store (v25.08+)
  • ✅ NeMo Guardrails (v25.08+) - Optional but recommended
  • LlamaStack Server - Unified API abstraction layer (deployed via Helm with Bearer token support)
  • KServe InferenceService with meta/llama-3.2-1b-instruct model
    • Service name: Your InferenceService predictor service name
    • Must be accessible via Istio service mesh
    • LlamaStack must have Istio sidecar injected to communicate with KServe services
  • ✅ Embedding NIM: nv-embedqa-1b-v2 service

Note: The service name for the Chat NIM may differ from the model name. Find your service name:

oc get svc -n <your-namespace> | grep llama
oc get inferenceservice -n <your-namespace> | grep llama

Required Configuration

1. Service Account Token (REQUIRED for LlamaStack)

LlamaStack requires a Kubernetes service account token to authenticate with the KServe InferenceService. This token must be set in env.donotcommit:

Get your service account token:

# Replace <your-namespace> and <service-account-name> with your actual values
# The service account name is typically: <inferenceservice-name>-sa
oc create token <service-account-name> -n <your-namespace> --duration=8760h

Example (replace with your actual service account and namespace):

oc create token my-model-sa -n my-namespace --duration=8760h

Add to env.donotcommit:

NIM_SERVICE_ACCOUNT_TOKEN=eyJhbGciOiJSUzI1NiIsImtpZCI6...  # Your token here

2. Model's External URL (REQUIRED for fallback)

The notebook uses the external HTTPS URL as a fallback when LlamaStack is unavailable. Find your InferenceService external URL:

# Get the external URL of your InferenceService
oc get inferenceservice <your-inferenceservice-name> -n <your-namespace> -o jsonpath='{.status.url}'

Example:

oc get inferenceservice my-model -n my-namespace -o jsonpath='{.status.url}'
# Example output: https://my-model-my-namespace.apps.my-cluster.example.com

Add to env.donotcommit (if not auto-detected): The config.py file should auto-detect this, but you can override it if needed.

3. Istio Service Mesh Membership

Your namespace must be a member of the Istio service mesh for LlamaStack to communicate with KServe InferenceService:

# Check if namespace is in the mesh
oc get servicemeshmember -n <your-namespace>

# If not, add it (requires cluster admin or service mesh admin)
# This is typically done during initial setup

4. LlamaStack Configuration

Important: LlamaStack deployment depends on the InferenceService being deployed first. The LlamaStack pod will be in Pending state until the InferenceService creates the required service account.

LlamaStack must be deployed with:

  • ✅ Istio sidecar injection enabled (sidecar.istio.io/inject: "true")
  • ✅ Bearer token authentication enabled (llamastack.useBearerToken: true)
  • ✅ Service account token configured

These are typically configured in the Helm chart values. Verify LlamaStack deployment:

# Check LlamaStack pod status (may be Pending until InferenceService is deployed)
oc get pods -n <your-namespace> | grep llamastack

# Once deployed, verify Istio sidecar is present
oc get pod -n <your-namespace> -l app=nemo-llamastack -o jsonpath='{.items[0].spec.containers[*].name}'
# Should show: llamastack-ctr istio-proxy

Note: If LlamaStack pod is in Pending state with error about missing service account, this is expected. Deploy your InferenceService first, and LlamaStack will automatically deploy once the service account is created.

Python Environment

  • Python 3.8+
  • Jupyter Lab
  • llama-stack-client (installed automatically in notebook)

Quick Start

Run in Workbench/Notebook (Cluster Mode)

The notebook runs in a Workbench/Notebook within the cluster and uses cluster-internal service URLs.

  1. Copy the notebook to the Workbench/Notebook pod:
# Replace <your-namespace> with your actual namespace (find with: oc projects)
NAMESPACE=<your-namespace>

# Get the Workbench/Notebook pod name
JUPYTER_POD=$(oc get pods -n $NAMESPACE -l app=jupyter-notebook -o jsonpath='{.items[0].metadata.name}')

# Copy the RAG demo files to the pod
oc cp demos/rag/rag-tutorial.ipynb $JUPYTER_POD:/work -n $NAMESPACE
oc cp demos/rag/config.py $JUPYTER_POD:/work -n $NAMESPACE
oc cp demos/rag/requirements.txt $JUPYTER_POD:/work -n $NAMESPACE
oc cp demos/rag/env.donotcommit.example $JUPYTER_POD:/work -n $NAMESPACE
  1. Access Workbench/Notebook in the cluster:
# Port-forward Workbench/Notebook
# Replace <your-namespace> with your actual namespace
oc port-forward -n <your-namespace> svc/jupyter-service 8888:8888
  1. Open Workbench/Notebook in browser: http://localhost:8888 (token: token)

  2. Install dependencies in the notebook:

The notebook uses cluster-internal service URLs automatically. No port-forwards needed for services!

  1. Configure Environment

🔒 SECURITY: This demo uses env.donotcommit file for sensitive configuration. The file is git-ignored and will NOT be committed.

Create env.donotcommit file from the template:

# Copy the template
cp env.donotcommit.example env.donotcommit

# Edit env.donotcommit and fill in your values

Required Configuration in env.donotcommit:

  1. Namespace (REQUIRED):
NMS_NAMESPACE=<your-namespace>

Find your namespace:

oc projects
  1. Service Account Token (REQUIRED for LlamaStack):
NIM_SERVICE_ACCOUNT_TOKEN=<your-service-account-token>

Get your token:

# Replace with your actual service account name (typically: <inferenceservice-name>-sa)
# Example: oc create token my-model-sa -n my-namespace --duration=8760h
oc create token <service-account-name> -n <your-namespace> --duration=8760h
  1. Model External URL (REQUIRED for fallback): The external URL is typically auto-detected from the InferenceService, but you can verify:
oc get inferenceservice <your-inferenceservice-name> -n <your-namespace> -o jsonpath='{.status.url}'

Optional Configuration:

  • NDS_TOKEN=token - NeMo Data Store token (default: "token")
  • DATASET_NAME=rag-tutorial-documents - Dataset name for RAG documents
  • RAG_TOP_K=5 - Number of documents to retrieve
  • RAG_SIMILARITY_THRESHOLD=0.3 - Similarity threshold for retrieval

Find your service names:

# Chat NIM service (KServe InferenceService)
oc get inferenceservice -n <your-namespace>
oc get svc -n <your-namespace> | grep predictor

# Embedding NIM service
oc get svc -n <your-namespace> | grep embedqa

Configuration

The notebook uses config.py which:

  • Sets up cluster-internal service URLs automatically
  • Loads configuration from env.donotcommit file (git-ignored, secure)

🔒 Security: All sensitive values (tokens, API keys) are loaded from env.donotcommit file, which is git-ignored and will NOT be committed to version control.

Service URLs

Cluster Mode (Workbench/Notebook within cluster):

  • Data Store: http://nemodatastore-sample.{namespace}.svc.cluster.local:8000
  • Entity Store: http://nemoentitystore-sample.{namespace}.svc.cluster.local:8000
  • Guardrails: http://nemoguardrails-sample.{namespace}.svc.cluster.local:8000
  • Chat NIM: http://meta-llama3-1b-instruct.{namespace}.svc.cluster.local:8000
  • Embedding NIM: http://nv-embedqa-1b-v2.{namespace}.svc.cluster.local:8000
  • LlamaStack: http://llamastack.{namespace}.svc.cluster.local:8321

RAG Workflow

1. Document Ingestion

  • Upload documents (PDFs, text files, etc.) to NeMo Data Store
  • Register dataset using LlamaStack's client.beta.datasets.register() API for Data Store
  • Register dataset in Entity Store using direct HTTP API (required for some NeMo services)
  • Documents are stored in a namespace for organization

2. Embedding Generation

  • Use NeMo Embedding NIM (nv-embedqa-1b-v2) to generate embeddings
  • Each document chunk is converted to a vector representation

3. Vector Storage

  • Store embeddings and metadata in NeMo Entity Store
  • Entity Store provides vector similarity search capabilities

4. Query Processing

  • User submits a query
  • Query is embedded using the same embedding model
  • Similarity search retrieves top-K most relevant documents

5. Response Generation

  • Retrieved documents are used as context
  • LlamaStack client generates a response using chat completions API (with fallback to direct NIM)
  • Optional: Guardrails validate response safety

Customization

Adjusting Retrieval Parameters

# In config.py or notebook
RAG_TOP_K = 5  # Number of documents to retrieve
RAG_SIMILARITY_THRESHOLD = 0.3  # Minimum similarity score

Using Different Models

The notebook uses:

  • Chat Model: meta/llama-3.2-1b-instruct (via NIM service)
  • Embedding Model: nv-embedqa-1b-v2 (via NIM service)

Note: The service name may differ from the model name. For example, the model meta/llama-3.2-1b-instruct might be deployed as service meta-llama3-1b-instruct. Find your service name:

oc get svc -n <your-namespace> | grep llama

To use different models, update the service names in config.py or set them in env.donotcommit.

Adding Guardrails

Guardrails can be integrated to:

  • Filter unsafe content
  • Validate response quality
  • Enforce compliance policies

Troubleshooting

Embedding NIM: Connection Refused / Pod Pending (Insufficient nvidia.com/gpu)

If you see Connection refused to nv-embedqa-1b-v2 and the pod is Pending with 0/x nodes available: x Insufficient nvidia.com/gpu:

  • The Embedding NIM (nv-embedqa-1b-v2) requests 1 GPU. No node has an available GPU, so the pod never starts.
  • Options:
    1. Free a GPU: Scale down or delete other GPU workloads so the embedding NIM pod can schedule. Check: oc get pods -n <namespace> -o wide and look for GPU-using pods.
    2. Add GPU nodes to the cluster if you have no (or insufficient) GPU capacity.
    3. Use the notebook’s CPU fallback: The RAG notebook tries sentence-transformers (all-MiniLM-L6-v2) on CPU when NIM is unreachable. Install once: %pip install sentence-transformers, then re-run the “Generate embeddings” cell. You’ll see a message like “Using CPU fallback: sentence-transformers/all-MiniLM-L6-v2”. Retrieval quality may differ slightly from NIM but the tutorial runs end-to-end.
  • To disable the embedding NIM and save GPU for other workloads, set nimCacheEmbedding.enabled: false and nimPipelineEmbedding.enabled: false in the nemo-instances Helm values (they are already false by default).

Documents Not Retrieving

  • Verify documents were uploaded to Data Store
  • Check embeddings were generated and stored in Entity Store
  • Verify similarity threshold is not too high

Service Connection Errors

  • Verify all services are running: oc get pods -n <your-namespace>
  • Check service URLs in config.py match your deployment
  • Verify env.donotcommit file exists and has correct NMS_NAMESPACE value
  • Ensure you're running the notebook in a Workbench/Notebook within the cluster

LlamaStack Connection Errors

If LlamaStack is failing with 500 errors or connection issues:

  1. Verify LlamaStack has Istio sidecar:
oc get pod -n <your-namespace> -l app=nemo-llamastack -o jsonpath='{.items[0].spec.containers[*].name}'
# Should show: llamastack-ctr istio-proxy
  1. Verify service account token is set:
# Check token is in env.donotcommit
grep NIM_SERVICE_ACCOUNT_TOKEN env.donotcommit

# Verify token is valid (should not be empty)
oc create token <service-account-name> -n <your-namespace> --duration=8760h
  1. Verify namespace is in Istio mesh:
oc get servicemeshmember -n <your-namespace>
# Should show your namespace as a member
  1. Check LlamaStack logs:
oc logs -n <your-namespace> -l app=nemo-llamastack --tail=100
  1. Verify KServe InferenceService is accessible:
# Test from within the cluster (from a pod with Istio sidecar)
oc exec -n <your-namespace> <llamastack-pod> -- curl -s http://<predictor-service>.<namespace>.svc.cluster.local:80/v1/models
  1. Fallback works: If LlamaStack fails, the notebook automatically falls back to direct NIM calls using the external HTTPS URL with the service account token.

Version Compatibility

  • NeMo Data Store: v25.08+
  • NeMo Entity Store: v25.08+
  • NeMo Guardrails: v25.08+
  • Chat NIM: meta/llama-3.2-1b-instruct:1.8.3 (service name may vary)
  • Embedding NIM: nvidia/llama-3.2-nv-embedqa-1b-v2 (via NIM service)

LlamaStack Integration

This demo uses LlamaStack for both chat completions and dataset registration, providing a unified API abstraction layer over NeMo microservices. The integration:

  • Uses LlamaStack client for chat completions (with automatic fallback to direct NIM if LlamaStack is unavailable)
  • Uses LlamaStack's client.beta.datasets.register() API for Data Store dataset registration
  • Uses direct HTTP requests for Entity Store dataset registration (required for some NeMo services like Customizer and Evaluator)
  • Uses direct HTTP requests for Data Store namespace operations (namespace creation)
  • Uses direct NIM calls for embeddings (as LlamaStack may not expose embeddings API directly)
  • Maintains backward compatibility - works even if LlamaStack service is not deployed (with reduced functionality)

Dataset Registration with LlamaStack

The demo uses LlamaStack's dataset registration API for Data Store:

response = client.beta.datasets.register(
    purpose="post-training/messages",
    dataset_id=DATASET_NAME,
    source={
        "type": "uri",
        "uri": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}"
    },
    metadata={
        "format": "json",
        "description": f"RAG tutorial documents for {DATASET_NAME}",
        "provider_id": "nvidia",
    }
)

This registers the dataset in Data Store using the hf://datasets/ URI format, which references Data Store's HuggingFace-compatible API.

Entity Store Registration

Entity Store registration is still done via direct HTTP API, as LlamaStack's client API does not expose Entity Store operations:

response = requests.post(
    f"{ENTITY_STORE_URL}/v1/datasets",
    json={
        "name": DATASET_NAME,
        "namespace": NMS_NAMESPACE,
        "description": f"RAG tutorial documents for {DATASET_NAME}",
        "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}",
        "project": "rag-tutorial",
        "format": "json",
    },
)

Entity Store registration is required for some NeMo services (like Customizer and Evaluator) that need to reference datasets by their Entity Store ID.

LlamaStack Benefits

  • Type safety: Pydantic models instead of raw JSON
  • Unified API: Single client for inference operations
  • Better error handling: Typed exceptions
  • Simplified code: Less boilerplate than direct REST calls for chat completions

Files

  • rag-tutorial.ipynb - Main tutorial notebook (nemo-instances LlamaStack)
  • rag-tutorial-rhoai.ipynb - Same RAG flow using RHOAI LlamaStack (copilot-llama-stack)
  • config.py - Configuration file (cluster mode, includes LlamaStack URL)
  • requirements.txt - Python dependencies (includes llama-stack-client)
  • ../../commands.md - Quick command reference guide (concise version without detailed explanations)
  • env.donotcommit.example - Template for environment configuration (copy to env.donotcommit)

Documentation