When you create a collection in Milvus to use with the NVIDIA RAG Blueprint server, there are specific schema requirements that must be followed to ensure compatibility with the search and generate APIs. This document outlines the required fields and their configurations.
:::{note} If you are using either LangChain's Milvus integration or NVIDIA's nv-ingest tool for data ingestion, these schema requirements are automatically handled for you. Both tools will create and configure the collection with the correct schema fields. You only need to ensure these requirements when manually creating collections or using custom ingestion methods. :::
The following fields are required in your Milvus collection schema:
-
Vector Field
- Name:
vector - Description: Stores the document embeddings
- Name:
-
Text Field
- Name:
text - Description: Stores the document content
- Name:
-
Source Field
- Name:
source - Can be configured in two ways:
- Simple string format: Directly store the filename
- JSON format: Store a JSON object with a
source_idfield containing the filename
{ "source_id": "document.pdf" }
- Name:
-
Content Metadata Field (Optional)
- Name:
content_metadata - Type:
JSON(DataType.JSON) - Description: Stores additional metadata about the document content
- Can be used for filtering during search and retrieval
- Name:
Here's an example of a complete schema definition that meets all requirements:
{
'auto_id': True,
'description': '',
'fields': [
{
'name': 'pk',
'description': '',
'type': DataType.INT64,
'is_primary': True,
'auto_id': True
},
{
'name': 'vector',
'description': '',
'type': DataType.FLOAT_VECTOR,
'params': {'dim': 2048}
},
{
'name': 'source',
'description': '',
'type': DataType.JSON
},
{
'name': 'content_metadata',
'description': '',
'type': DataType.JSON
},
{
'name': 'text',
'description': '',
'type': DataType.VARCHAR,
'params': {'max_length': 65535}
}
],
'enable_dynamic_field': True
}When using this schema with the RAG server:
- The search API will use the
vectorfield for similarity search - The
textfield will be used to return the actual content - The
sourcefield will be used to track document sources - The
content_metadatafield can be used for filtering using thefilter_exprparameter in search and generate APIs
For more information about using metadata for filtering, refer to the Custom Metadata Documentation.