This documentation contains the release notes for NVIDIA RAG Blueprint.
This release adds new features to the RAG pipeline for supporting agent workflows and enhances generations with VLMs augmenting multimodal input.
This release contains the following key changes:
- Updated NIMs and code to support NVIDIA Ingest 26.01 release.
- Added support for non-NIM models including OpenAI, models hosted on AWS and Azure, OSS models, and others. Supported through service-specific API keys. For details, refer to Get an API Key.
- The RAG Blueprint now uses nemoretriever-ocr-v1 as the default OCR model. For details, refer to NeMo Retriever OCR Configuration Guide.
- Improved VLM based generation support. The Vision-Language Model (VLM) inference feature now uses the model nemotron-nano-12b-v2-vl. For details, refer to VLM for Generation.
- User interface improvements including catalog display, image and text query, and others. For details, refer to User Interface.
- Added ingestion metrics endpoint support with OpenTelemetry (OTEL) for monitoring document uploads, elements ingested, and pages processed. For details, refer to Observability.
- Support image and text as input query. For details, refer to Multimodal Query Support.
- Nemotron-3-Nano model support with reasoning budget. For details, refer to Enable Reasoning.
- Vector Database enhancements including secure database access. For details, refer to Milvus Configuration and Elasticsearch Configuration.
- You can now access RAG functionality from a Model Context Protocol (MCP) server for tool integration. For details, refer to MCP Server and Client Usage.
- Added OpenAI-compatible search endpoint for integration with OpenAI tools. For details, refer to API - RAG Server Schema.
- Added support for collection-level data catalog, descriptions, and metadata. For details, refer to Data Catalog.
- Enhanced
/statusendpoint publishing ingestion metrics and status information. For details, refer to the ingestion notebook. - Multi-turn conversation support is no longer the default for either retrieval or generation stage in the pipeline. Refer to Multi-Turn Conversation Support for details.
- Improved document processing and element extraction.
- Enhancements to RAG library mode including the following. For details, refer to Use the NVIDIA RAG Blueprint Python Package.
- Independent multi-instance support for the RAG Server and the ingestion server
- Configuration support through function arguments
- Async interface for RAG methods
- Compatibility with the NVIDIA NeMo Agent Toolkit (NAT)
- Summarization enhancements including the following. For details, refer to Document Summarization Customization Guide.
- Shallow summarization support
- Easy model switches and dedicated configurations
- Ease of prompt changes
- Reserved field names
type,subtype, andlocationfor NV-Ingest exclusive use in metadata schemas. - Added support for rag_library_lite_usage.ipynb which demonstrates containerless deployment of the NVIDIA RAG Python package in lite mode.
- Added example showcasing NeMo Agent Toolkit integration with NVIDIA RAG.
- Added weighted hybrid search support with configurable weights.
- RAG server logging improvements
The following are the known issues that are fixed in this version:
- Fixed issue in NIM LLM for automatic profile selection. For details, refer to Model Profiles.
The following are the known limitations in this version:
- DRA support using NIM operator based helm chart is not available in this release.
For the full list of known issues, refer to Known Issues.
This release is a hotfix for RAG v2.3.0, and includes the following changes:
- Bump embedqa version to 1.10.1 and nim-llm to version 1.14.0.
- Align Helm values and any referenced tags with the new embedqa and nim-llm versions.
The following are the known issues for the NVIDIA RAG Blueprint:
- DRA support
- Optional features reflection and image captioning are not available in Helm-based deployment.
- Currently, Helm-based deployment is not supported for NeMo Guardrails.
- The Blueprint responses can have significant latency when using NVIDIA API Catalog cloud hosted models.
- The accuracy of the pipeline is optimized for certain file types like
.pdf,.txt,.docx. The accuracy may be poor for other file types supported by NV-Ingest, since image captioning is disabled by default. - When updating model configurations in Kubernetes
values.yaml(for example, changing from 70B to 8B models), the RAG UI automatically detects and displays the new model configuration from the backend. No container rebuilds are required - simply redeploy the Helm chart with updated values and refresh the UI to see the new model settings in the Settings panel. - The NeMo LLM microservice can take 5-6 minutes to start for every deployment.
- B200 GPUs are not supported for the following advanced features. For these features, use H100 or A100 GPUs instead.
- Image captioning support for ingested documents
- NeMo Guardrails for guardrails at input/output
- VLM-based inferencing in RAG
- PDF extraction with Nemotron Parse
- Sometimes when HTTP cloud NIM endpoints are used from
deploy/compose/.env, thenv-ingest-ms-runtimestill logs gRPC environment variables. Following log entries can be ignored. - For MIG support, currently the ingestion profile has been scaled down while deploying the chart with MIG slicing. This affects the ingestion performance during bulk ingestion, specifically large bulk ingestion jobs might fail.
- Individual file uploads are limited to a maximum size of 400 MB during ingestion. Files exceeding this limit are rejected and must be split into smaller segments before ingesting.
llama-3.3-nemotron-super-49b-v1.5model provides more verbose responses in non-reasoning mode compared to v1.0. For some queries the LLM model may respond with information not available in given context. Also for out of domain queries the model may provide responses based on its own knowledge. Developers are strongly advised to tune the prompt for their use cases to avoid these scenarios.- Slow VDB upload is observed in Helm deployments for Elasticsearch.
- Audio model deployment on Kubernetes on RTX‑6000 Pro is not supported in this release.