Welcome to the NVIDIA RAG Blueprint documentation. You can learn more here, including how to get started with the RAG Blueprint, how to customize the RAG Blueprint, and how to troubleshoot the RAG Blueprint.
- To view this documentation on docs.nvidia.com, browse to NVIDIA RAG Blueprint Documentation.
- To view this documentation on GitHub, browse to NVIDIA RAG Blueprint Documentation.
For the release notes, refer to Release Notes.
For hardware requirements and other information, refer to the Support Matrix.
- Use the procedures in Get Started to get started quickly with the NVIDIA RAG Blueprint.
- Experiment and test in the Web User Interface.
- Use the Python Package to interact with the RAG system directly from Python code.
- Explore the notebooks that demonstrate how to use the APIs. For details refer to Notebooks.
You can deploy the RAG Blueprint with Docker, Helm, or NIM Operator, and target dedicated hardware or a Kubernetes cluster. Use the following documentation to deploy the blueprint.
:::{important} Before you deploy, consider the following:
- Self-hosted deployments require ~200GB of free disk space for model downloads and caching.
- First-time deployments take 15-30 minutes (Docker) or 60-70 minutes (Kubernetes) as large models are downloaded.
- Model downloads do not show progress bars; see the deployment guides for monitoring commands.
- Subsequent deployments are much faster (2-15 minutes) because models are already cached.
For detailed requirements, refer to Support Matrix. :::
- Deploy with Docker (Self-Hosted Models)
- Deploy with Docker (NVIDIA-Hosted Models)
- Deploy on Kubernetes with Helm
- Deploy on Kubernetes with Helm from the repository
- Deploy on Kubernetes with Helm and MIG Support
- Deploy Retrieval-Only Mode
Alternative Deployment Options:
- Use the Python Package (Library Mode) - Use the NVIDIA RAG Python package directly for programmatic access to the RAG system
- Containerless Deployment (Lite Mode) - Simplified Python-only setup using Milvus Lite and NVIDIA cloud APIs, without Docker containers
After you deploy the RAG blueprint, you can customize it for your use cases.
-
Common configurations
- Best Practices for Common Settings
- Change the LLM or Embedding Model
- Customize LLM Parameters at Runtime
- Customize Prompts
- Model Profiles for Hardware Configurations
- Multi-Collection Retrieval
- Multi-Turn Conversation Support
- Reasoning in Nemotron LLM model
- Self-reflection to improve accuracy
- Summarization
-
Data Ingestion & Processing
-
Vector Database and Retrieval
-
Multimodal and Advanced Generation
-
Evaluation
-
Governance
-
Observability and Telemetry
- NVIDIA NeMo Retriever Delivers Accurate Multimodal PDF Data Extraction 15x Faster
- Finding the Best Chunking Strategy for Accurate AI Responses
:name: NVIDIA RAG Blueprint
:caption: NVIDIA RAG Blueprint
:maxdepth: 1
:hidden:
Release Notes <release-notes.md>
Support Matrix <support-matrix.md>
:name: Get Started
:caption: Get Started
:maxdepth: 1
:hidden:
Get an API Key <api-key.md>
Get Started with the RAG Blueprint <deploy-docker-self-hosted.md>
Web User Interface <user-interface.md>
Use the RAG Python Package <python-client.md>
Notebooks <notebooks.md>
:name: Deployment Options for RAG Blueprint
:caption: Deployment Options for RAG Blueprint
:maxdepth: 1
:hidden:
Deploy with Docker (NVIDIA-Hosted Models) <deploy-docker-nvidia-hosted.md>
Deploy on Kubernetes with Helm <deploy-helm.md>
Deploy on Kubernetes with Helm from the repository <deploy-helm-from-repo.md>
Deploy on Kubernetes with Helm and MIG Support <mig-deployment.md>
Deploy Retrieval-Only Mode <retrieval-only-deployment.md>
:name: Common configurations
:caption: Common configurations
:maxdepth: 1
:hidden:
Best Practices for Common Settings <accuracy_perf.md>
Change the Model <change-model.md>
Customize Parameters <llm-params.md>
Customize Prompts <prompt-customization.md>
Model Profiles <model-profiles.md>
Multi-Collection Retrieval <multi-collection-retrieval.md>
Multi-Turn Conversation Support <multiturn.md>
Reasoning <enable-nemotron-thinking.md>
Self-reflection <self-reflection.md>
Summarization <summarization.md>
:name: Data Ingestion and Processing
:caption: Data Ingestion and Processing
:maxdepth: 1
:hidden:
Audio Ingestion Support <audio_ingestion.md>
Custom metadata Support <custom-metadata.md>
Data Catalog for Collections and Documents <data-catalog.md>
File System Access to Results <mount-ingestor-volume.md>
Multimodal Embedding Support (Early Access) <vlm-embed.md>
OCR Configuration Guide <nemoretriever-ocr.md>
Enhanced PDF Extraction <nemotron-parse-extraction.md>
Standalone NV-Ingest <nv-ingest-standalone.md>
Text-Only Ingestion <text_only_ingest.md>
MCP Server Usage <mcp.md>
:name: Vector Database and Retrieval
:caption: Vector Database and Retrieval
:maxdepth: 1
:hidden:
Change the Vector Database <change-vectordb.md>
Hybrid Search <hybrid_search.md>
Milvus Configuration <milvus-configuration.md>
Query Decomposition <query_decomposition.md>
:name: Multimodal and Advanced Generation
:caption: Multimodal and Advanced Generation
:maxdepth: 1
:hidden:
Image Captioning <image_captioning.md>
Multimodal Query Support <multimodal-query.md>
VLM-based Inferencing <vlm.md>
:name: Evaluation
:caption: Evaluation
:maxdepth: 1
:hidden:
Evaluate Your RAG System <evaluate.md>
:name: Governance
:caption: Governance
:maxdepth: 1
:hidden:
NeMo Guardrails <nemo-guardrails.md>
:name: Observability and Telemetry
:caption: Observability and Telemetry
:maxdepth: 1
:hidden:
Observability <observability.md>
Query-to-Answer Pipeline <query-to-answer-pipeline.md>
:name: Troubleshoot RAG Blueprint
:caption: Troubleshoot RAG Blueprint
:maxdepth: 1
:hidden:
Troubleshoot <troubleshooting.md>
RAG Pipeline Debugging Guide <debugging.md>
Migration Guide <migration_guide.md>
:name: Reference
:caption: Reference
:maxdepth: 1
:hidden:
Milvus Collection Schema <milvus-schema.md>
Service Port and GPU Reference <service-port-gpu-reference.md>
API - Ingestor Server Schema <api-ingestor.md>
API - RAG Server Schema <api-rag.md>