linagora
diff --git a/‎.hydra_config/config.yaml‎
Lines changed: 1 addition & 0 deletions b/‎.hydra_config/config.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/content/docs/documentation/API.mdx‎
Lines changed: 16 additions & 0 deletions b/‎docs/content/docs/documentation/API.mdx‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎docs/content/docs/documentation/milvus_migration.md‎
Lines changed: 64 additions & 20 deletions b/‎docs/content/docs/documentation/milvus_migration.md‎
Lines changed: 64 additions & 20 deletions
diff --git a/‎extern/indexer-ui‎ b/‎extern/indexer-ui‎
diff --git a/‎openrag/components/indexer/vectordb/vectordb.py‎
Lines changed: 56 additions & 5 deletions b/‎openrag/components/indexer/vectordb/vectordb.py‎
Lines changed: 56 additions & 5 deletions
diff --git a/‎openrag/routers/extract.py‎
Lines changed: 1 addition & 1 deletion b/‎openrag/routers/extract.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎openrag/routers/indexer.py‎
Lines changed: 42 additions & 2 deletions b/‎openrag/routers/indexer.py‎
Lines changed: 42 additions & 2 deletions
@@ -40,6 +40,7 @@ vectordb:
   collection_name: ${oc.env:VDB_COLLECTION_NAME, vdb_test}
   hybrid_search: ${oc.env:VDB_HYBRID_SEARCH, true}
   enable: true
+  schema_version: 1  # Increment when the collection schema changes and a migration is required
 
 rdb:
   host: ${oc.env:POSTGRES_HOST, rdb}
 
@@ -78,6 +78,22 @@ Upload a new file to a specific partition for indexing.
 - `201 Created`: Returns task status URL
 - `409 Conflict`: File already exists in partition
 
+##### Temporal Filtering
+OpenRAG supports temporal filtering to retrieve documents from specific time periods.
+The client can include temporal fields to allow temporal aware search in search endpoints.
+
+4 temporal fields are automatically added to the schema of the collection:
+
+* `datetime`: ISO 8601 format date of the primary timestamp of when a file is created in your system
+* `created_at`: ISO 8601 format date of when the file was created
+* `updated_at`: ISO 8601 format date of when the file was last modified
+* `indexed_at`: ISO 8601 format date of when the file was indexed in the vector database
+
+:::info
+`datetime`, `created_at` and `updated_at` are provided by the client in the metadata of the file during upload, while `indexed_at` is automatically set by openRAG at indexing time.
+:::
+
+
 ##### Upload files while modeling relations between them
 
 OpenRAG supports document relationships to enable context-aware retrieval.
 
@@ -41,24 +41,6 @@ results = client.query(
 > * `PT3H` = 3 hours
 > * `P2DT6H` = 2 days and 6 hours.
 
-## Current State
-
-:::info
-Temporal fields are currently stored as **strings**, not **`TIMESTAMPTZ`**. Migrating to `TIMESTAMPTZ` requires a schema and index change, and Milvus doesn't support migrations on schema and index changes: it has to be handled manually.
-
-Until a Milvus schema & index migration strategy is defined, filtering still works via **lexicographic string comparison** on ISO 8601 strings:
-```python
-expr = "tsz != '2025-01-03T00:00:00+08:00'"  # No ISO/INTERVAL keywords
-results = client.query(
-    collection_name,
-    filter=expr,
-    output_fields=["id", "tsz"],
-    limit=10
-)
-```
-Full `TIMESTAMPTZ` support will be activated in a future release once the migration is established.
-:::
-
 ## Milvus version upgrade Steps
 :::danger[Before running Milvus Version Migration]
 These steps must be performed on a deployment running OpenRAG **prior to version 1.1.6** (Milvus 2.5.4) before switching to a newest version of Openrag.
@@ -83,7 +65,7 @@ Then restart the stack:
 
 ```bash
 docker compose down
-docker compose up -d
+docker compose up --build milvus -d
 ```
 
 Wait for all services to be healthy before continuing.
@@ -129,4 +111,66 @@ docker inspect milvus-standalone --format '{{ .Config.Image }}'
 # Expected: milvusdb/milvus:v2.6.11
 ```
 
-Now you can switch to the newest release of OpenRAG and it should work fine.
+Now you can switch to the newest release of OpenRAG and it should work fine.
+
+## Schema Migration — Add Temporal Fields
+
+:::info
+This migration adds `TIMESTAMPTZ` fields (`datetime`, `created_at`, `updated_at`, `indexed_at`) and their `STL_SORT` indexes to an existing collection.
+
+Existing documents will have `null` for these fields; new documents will have them populated at index time.
+:::
+
+:::danger[OpenRAG must be stopped]
+Stop the OpenRAG application before running this migration.
+:::
+
+### Step 1 — Start only the Milvus container
+
+```bash
+docker compose up -d milvus
+```
+
+Wait until Milvus is healthy:
+
+```bash
+docker compose ps milvus
+```
+
+### Step 2 — Dry-run (inspect, no changes)
+
+```bash
+docker compose run --no-deps --rm --build --entrypoint "" openrag \
+    uv run python scripts/migrations/milvus/1.add_temporal_fields.py --dry-run
+```
+
+Review the output to confirm which fields and indexes are missing.
+
+### Step 3 — Apply the migration
+
+```bash
+docker compose run --no-deps --rm --build --entrypoint "" openrag \
+    uv run python scripts/migrations/milvus/1.add_temporal_fields.py
+```
+
+The script will:
+1. Add any missing `TIMESTAMPTZ` fields (nullable)
+2. Create `STL_SORT` indexes for each field
+3. Stamp the collection with `schema_version=1` so OpenRAG no longer reports a migration error on startup
+
+### Step 4 — Restart OpenRAG
+
+```bash
+docker compose up --build -d
+```
+
+### Rollback
+
+Milvus does not yet support dropping fields. The rollback only removes the indexes and resets the version stamp — the fields remain in the schema but are unused:
+
+```bash
+docker compose run --no-deps --rm --entrypoint "" openrag \
+    uv run python scripts/migrations/milvus/1.add_temporal_fields.py --downgrade
+```
+
+To fully remove the fields you would need to recreate the collection from scratch.
@@ -1,6 +1,7 @@
 import asyncio
 import time
 from abc import ABC, abstractmethod
+from datetime import UTC, datetime
 
 import numpy as np
 import ray
@@ -102,12 +103,8 @@ async def get_file_chunks(self, file_id: str, partition: str, include_id: bool =
     async def get_chunk_by_id(self, chunk_id: str):
         pass
 
-    # @abstractmethod
-    # def sample_chunk_ids(
-    #     self, partition: str, n_ids: int = 100, seed: int | None = None
-    # ):
-    #     pass
 
+SCHEMA_VERSION_PROPERTY_KEY = "openrag.schema_version"
 
 MAX_LENGTH = 65_535
 
@@ -140,6 +137,7 @@ def __init__(self):
 
             self.config = load_config()
             self.logger = get_logger()
+            self.time_fields = ["datetime", "created_at", "updated_at", "indexed_at"]
 
             # init milvus clients
             self.port = self.config.vectordb.get("port")
@@ -189,6 +187,7 @@ def load_collection(self):
             try:
                 if self._client.has_collection(self.collection_name):
                     self.logger.warning(f"Collection `{self.collection_name}` already exists. Loading it.")
+                    self._check_schema_version()
                 else:
                     self.logger.info("Creating empty collection")
                     index_params = self._create_index()
@@ -212,6 +211,7 @@ def load_collection(self):
                             collection_name=self.collection_name,
                             operation="create_collection",
                         )
+                    self._store_schema_version()
                 try:
                     self._client.load_collection(self.collection_name)
                     self.collection_loaded = True
@@ -283,6 +283,9 @@ def _create_schema(self):
             dim=self.embedder.embedding_dimension,
         )
 
+        for time_field in self.time_fields:
+            schema.add_field(field_name=time_field, datatype=DataType.TIMESTAMPTZ, nullable=True)
+
         if self.hybrid_search:
             # Add sparse field for BM25 - this will be auto-generated
             schema.add_field(
@@ -336,9 +339,54 @@ def _create_index(self):
                 "bm25_b": 0.75,
             },
         )
+        # indexes for dates TIMESTAMPTZ field
+        for time_field in self.time_fields:
+            index_params.add_index(
+                field_name=time_field,
+                index_type="STL_SORT",  # Index for TIMESTAMPTZ
+                index_name=f"{time_field}_idx",
+            )
 
         return index_params
 
+    def _store_schema_version(self) -> None:
+        """Persist the configured schema_version as a collection property after collection creation."""
+        schema_version = self.config.vectordb.get("schema_version")
+        self._client.alter_collection_properties(
+            collection_name=self.collection_name,
+            properties={SCHEMA_VERSION_PROPERTY_KEY: str(schema_version)},
+        )
+        self.logger.info(f"Schema version {schema_version} stored on collection `{self.collection_name}`.")
+
+    def _check_schema_version(self) -> None:
+        """
+        Read the stored schema version from collection properties and compare it
+        against the configured schema_version.  Raises VDBSchemaMigrationRequiredError
+        if they diverge so the application fails fast instead of silently working on a
+        stale schema.
+        """
+        expected_version = self.config.vectordb.get("schema_version")
+        desc = self._client.describe_collection(self.collection_name)
+        props = desc.get("properties", {})
+        raw = props.get(SCHEMA_VERSION_PROPERTY_KEY)
+
+        try:
+            stored_version = int(raw) if raw is not None else 0
+        except (ValueError, TypeError):
+            stored_version = 0
+
+        if stored_version != expected_version:
+            raise VDBSchemaMigrationRequiredError(
+                f"Collection `{self.collection_name}` is at schema version {stored_version} "
+                f"but the application requires version {expected_version}. "
+                "Please perform the migration script.",
+                collection_name=self.collection_name,
+                stored_version=stored_version,
+                expected_version=expected_version,
+            )
+
+        self.logger.info(f"Collection `{self.collection_name}` schema version {stored_version} — OK.")
+
     async def list_collections(self) -> list[str]:
         return self._client.list_collections()
 
@@ -379,12 +427,14 @@ async def async_add_documents(self, chunks: list[Document], user: dict) -> None:
             entities = []
             vectors = await self.embedder.aembed_documents(chunks)
             order_metadata_l: list[dict] = _gen_chunk_order_metadata(n=len(chunks))
+            indexed_at = datetime.now(UTC).isoformat()
 
             for chunk, vector, order_metadata in zip(chunks, vectors, order_metadata_l):
                 entities.append(
                     {
                         "text": chunk.page_content,
                         "vector": vector,
+                        "indexed_at": indexed_at,
                         **order_metadata,
                         **chunk.metadata,
                     }
@@ -396,6 +446,7 @@ async def async_add_documents(self, chunks: list[Document], user: dict) -> None:
             )
 
             # insert file_id and partition into partition_file_manager
+            file_metadata.update({"indexed_at": indexed_at})
             self.partition_file_manager.add_file_to_partition(
                 file_id=file_id,
                 partition=partition,
 
@@ -32,7 +32,7 @@
   - `partition`: Partition name
   - `page`: Page number in source document
   - `datetime`: Document date (if set)
-  - `modified_at`: File modification timestamp
+  - `updated_at`: File modification timestamp
   - `created_at`: File creation timestamp
   - `indexed_at`: Chunk indexing timestamp
   - Additional custom metadata
 
@@ -49,6 +49,30 @@
 # URL scheme configuration
 PREFERRED_URL_SCHEME = config.server.preferred_url_scheme
 
+# DATETIME FIELDS: Fields provided by the client
+
+TEMPORAL_FIELDS = ["datetime", "created_at", "updated_at"]
+
+
+def get_temporal_fields(metadata: dict) -> None:
+    temporal_fields = {}
+
+    ## Use provided created_at if available, otherwise extract from file system
+    for field in TEMPORAL_FIELDS:
+        datetime_str = metadata.get(field, None)
+        if datetime_str:
+            try:
+                # Try parsing the provided datetime to ensure it's valid
+                d = datetime.fromisoformat(datetime_str)
+                temporal_fields[field] = d.isoformat()
+            except ValueError:
+                raise HTTPException(
+                    status_code=status.HTTP_400_BAD_REQUEST,
+                    detail=f"Invalid ISO 8601 datetime field ({datetime_str}) for field '{field}'.",
+                )
+
+    return temporal_fields
+
 
 def build_url(request: Request, route_name: str, **path_params) -> str:
     """Build a URL using the preferred scheme if configured."""
@@ -100,9 +124,14 @@ async def get_supported_types():
     "mimetype": "text/plain",
     "author": "John Doe",
     ...
+    "indexed_at": "2024-01-01T12:00:00Z"  // Optional temporal field (ISO 8601)
 }
 ```
 
+**Temporal Fields:**
+- You can provide temporal fields such as `created_at`, `updated_at`, or `datetime` in the metadata for time-based queries and filtering.
+- Datetime values must be in ISO 8601 format (e.g., `2024-01-01T12:00:00Z`).
+
 **Common Mimetypes:**
 - `text/plain` - Plain text files
 - `text/markdown` - Markdown files
@@ -161,9 +190,12 @@ async def add_file(
 
     # Append extra metadata
     metadata["file_size"] = human_readable_size(file_stat.st_size)
-    metadata["created_at"] = datetime.fromtimestamp(file_stat.st_ctime).isoformat()
     metadata["file_id"] = file_id
 
+    ## Add temporal fields to metadata, using provided values if available, otherwise extracting from file system
+    temporal_fields = get_temporal_fields(metadata)
+    metadata.update(temporal_fields)
+
     # Indexing the file
     task = indexer.add_file.remote(path=file_path, metadata=metadata, partition=partition, user=user)
     await task_state_manager.set_state.remote(task.task_id().hex(), "QUEUED")
@@ -224,9 +256,14 @@ async def delete_file(
     "mimetype": "text/plain",
     "author": "John Doe",
     ...
+    "indexed_at": "2024-01-01T12:00:00Z"  // Optional temporal field (ISO 8601)
 }
 ```
 
+**Temporal Fields:**
+- You can provide temporal fields such as `created_at`, `updated_at`, or `datetime` in the metadata for time-based queries and filtering.
+- Datetime values must be in ISO 8601 format (e.g., `2024-01-01T12:00:00Z`).
+
 **Response:**
 Returns 202 Accepted with a task status URL for tracking indexing progress.
 """,
@@ -277,9 +314,12 @@ async def put_file(
 
     # Append extra metadata
     metadata["file_size"] = human_readable_size(file_stat.st_size)
-    metadata["created_at"] = datetime.fromtimestamp(file_stat.st_ctime).isoformat()
     metadata["file_id"] = file_id
 
+    ## Add temporal fields to metadata, using provided values if available, otherwise extracting from file system
+    temporal_fields = get_temporal_fields(metadata)
+    metadata.update(temporal_fields)
+
     # Indexing the file
     task = indexer.add_file.remote(path=file_path, metadata=metadata, partition=partition, user=user)
     await task_state_manager.set_state.remote(task.task_id().hex(), "QUEUED")