Skip to content

Commit 0f36f9c

Browse files
committed
BREAKING CHANGE: Adding a temporal field and migration script for milvus
1 parent d42cc14 commit 0f36f9c

File tree

16 files changed

+750
-68
lines changed

16 files changed

+750
-68
lines changed

.hydra_config/config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ vectordb:
4040
collection_name: ${oc.env:VDB_COLLECTION_NAME, vdb_test}
4141
hybrid_search: ${oc.env:VDB_HYBRID_SEARCH, true}
4242
enable: true
43+
schema_version: 1 # Increment when the collection schema changes and a migration is required
4344

4445
rdb:
4546
host: ${oc.env:POSTGRES_HOST, rdb}

docs/content/docs/documentation/API.mdx

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,18 @@ Upload a new file to a specific partition for indexing.
7878
- `201 Created`: Returns task status URL
7979
- `409 Conflict`: File already exists in partition
8080

81+
##### Temporal Filtering
82+
OpenRAG supports temporal filtering to retrieve documents from specific time periods.
83+
The client can include the temporal field to allow temporal-aware search in search endpoints.
84+
85+
* `created_at`: ISO 8601 format date of when the file was created
86+
87+
:::info
88+
`created_at` is provided by the client in the metadata of the file during upload.
89+
This is a first iteration — additional temporal fields (e.g. `updated_at`) may be added in future releases as needed.
90+
:::
91+
92+
8193
##### Upload files while modeling relations between them
8294

8395
OpenRAG supports document relationships to enable context-aware retrieval.

docs/content/docs/documentation/milvus_migration.md

Lines changed: 63 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -41,24 +41,6 @@ results = client.query(
4141
> * `PT3H` = 3 hours
4242
> * `P2DT6H` = 2 days and 6 hours.
4343
44-
## Current State
45-
46-
:::info
47-
Temporal fields are currently stored as **strings**, not **`TIMESTAMPTZ`**. Migrating to `TIMESTAMPTZ` requires a schema and index change, and Milvus doesn't support migrations on schema and index changes: it has to be handled manually.
48-
49-
Until a Milvus schema & index migration strategy is defined, filtering still works via **lexicographic string comparison** on ISO 8601 strings:
50-
```python
51-
expr = "tsz != '2025-01-03T00:00:00+08:00'" # No ISO/INTERVAL keywords
52-
results = client.query(
53-
collection_name,
54-
filter=expr,
55-
output_fields=["id", "tsz"],
56-
limit=10
57-
)
58-
```
59-
Full `TIMESTAMPTZ` support will be activated in a future release once the migration is established.
60-
:::
61-
6244
## Milvus version upgrade Steps
6345
:::danger[Before running Milvus Version Migration]
6446
These steps must be performed on a deployment running OpenRAG **prior to version 1.1.6** (Milvus 2.5.4) before switching to the newest version of OpenRAG.
@@ -129,4 +111,66 @@ docker inspect milvus-standalone --format '{{ .Config.Image }}'
129111
# Expected: milvusdb/milvus:v2.6.11
130112
```
131113

132-
Now you can switch to the newest release of OpenRAG and it should work fine.
114+
Now you can switch to the newest release of OpenRAG and it should work fine.
115+
116+
## Schema Migration — Add Temporal Fields
117+
118+
:::info
119+
This migration adds a `TIMESTAMPTZ` fields `created_at` and a `STL_SORT` index to an existing collection.
120+
121+
Existing documents will have `null` for that field; new documents will have them populated at index time.
122+
:::
123+
124+
:::danger[OpenRAG must be stopped]
125+
Stop the OpenRAG application before running this migration.
126+
:::
127+
128+
### Step 1 — Start only the Milvus container
129+
130+
```bash
131+
docker compose up -d milvus
132+
```
133+
134+
Wait until Milvus is healthy:
135+
136+
```bash
137+
docker compose ps milvus
138+
```
139+
140+
### Step 2 — Dry-run (inspect, no changes)
141+
142+
```bash
143+
docker compose run --no-deps --rm --build --entrypoint "" openrag \
144+
uv run python scripts/migrations/milvus/1.add_temporal_fields.py --dry-run
145+
```
146+
147+
Review the output to confirm which fields and indexes are missing.
148+
149+
### Step 3 — Apply the migration
150+
151+
```bash
152+
docker compose run --no-deps --rm --build --entrypoint "" openrag \
153+
uv run python scripts/migrations/milvus/1.add_temporal_fields.py
154+
```
155+
156+
The script will:
157+
1. Add any missing `TIMESTAMPTZ` fields (nullable)
158+
2. Create `STL_SORT` indexes for each field
159+
3. Stamp the collection with `schema_version=1` so OpenRAG no longer reports a migration error on startup
160+
161+
### Step 4 — Restart OpenRAG
162+
163+
```bash
164+
docker compose up --build -d
165+
```
166+
167+
### Rollback
168+
169+
Milvus does not yet support dropping fields. The rollback only removes the indexes and resets the version stamp — the fields remain in the schema but are unused:
170+
171+
```bash
172+
docker compose run --no-deps --rm --build --entrypoint "" openrag \
173+
uv run python scripts/migrations/milvus/1.add_temporal_fields.py --downgrade
174+
```
175+
176+
To fully remove the fields you would need to recreate the collection from scratch.
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
---
2+
title: Temporality
3+
---
4+
5+
# Milvus representation
6+
7+
* As scalar field
8+
9+
Scalar fields store primitive, structured values—commonly referred to as metadata—such as numbers, strings, or dates.
10+
11+
They allow you to narrow search results based on specific attributes, like limiting documents to a particular category or a defined **time range**.
12+
13+
* You can set nullable=True for TIMESTAMPTZ fields to allow missing values.
14+
* You can specify a default timestamp value using the default_value attribute in ISO 8601 format.
15+
16+
* format: timestamp (ISO 8601 format)
17+
* All temporal fields are stored in ISO 8601 format
18+
19+
* **Automatic date extraction**
20+
21+
# Operation
22+
## Add a TIMESTAMPTZ field that allows null values
23+
* schema.add_field("tsz", DataType.TIMESTAMPTZ, nullable=True)
24+
* You can specify a default timestamp value using the **`default_value`** attribute in **`ISO 8601` format**.
25+
26+
27+
## Filtering operations
28+
29+
Compatible with milvus 2.6.6
30+
31+
* **`TIMESTAMPTZ`** supports scalar comparisons, interval arithmetic, and extraction of time components.
32+
33+
* **Comparison and filtering**: All filtering and ordering operations are performed in UTC, ensuring consistent and predictable results across different time zones.
34+
35+
* Query with timestamp filtering
36+
* Use arithmetic operators like ==, !=, <, >, <=, >=. For a full list of arithmetic operators available in Milvus, refer to [Arithmetic Operators](https://milvus.io/docs/basic-operators.md#Arithmetic-Operators)
37+
38+
* timestamp filtering
39+
40+
```python
41+
expr = "tsz != ISO '2025-01-03T00:00:00+08:00'"
42+
43+
results = client.query(
44+
collection_name=collection_name,
45+
filter=expr,
46+
output_fields=["id", "tsz"],
47+
limit=10
48+
)
49+
50+
print("Query result: ", results)
51+
```
52+
53+
* Interval operations
54+
* You can perform arithmetic on TIMESTAMPTZ fields using INTERVAL values in the ISO 8601 duration format. This allows you to add or subtract durations, such as days, hours, or minutes, from a timestamp when filtering data.
55+
56+
```python
57+
expr = "tsz + INTERVAL 'P0D' != ISO '2025-01-03T00:00:00+08:00'"
58+
59+
results = client.query(
60+
collection_name,
61+
filter=expr,
62+
output_fields=["id", "tsz"],
63+
limit=10
64+
)
65+
66+
print("Query result: ", results)
67+
```
68+
69+
* **`INTERVAL`** values follow the **`ISO 8601` duration** syntax. For example:
70+
* P1D1 day
71+
* PT3H3 hours
72+
* P2DT6H2 days and 6 hours
73+
74+
* You can use **`INTERVAL`** arithmetic directly in filter expressions, such as:
75+
* tsz + INTERVAL 'P3D' → Adds 3 days
76+
* tsz - INTERVAL 'PT2H' → Subtracts 2 hours
77+
78+
* Search with timestamp filtering
79+
* You can combine **`TIMESTAMPTZ`** filtering with vector similarity search to narrow results by both time and similarity.
80+
81+
82+
83+
--------
84+
85+
* Migration from Milvus v2.5.4 to v2.6.11
86+
* TIMESTAMPTZ is compatible with Milvus 2.6.6+
87+
88+
* Migration according to the release notes for Milvus Standalone: https://milvus.io/docs/upgrade_milvus_standalone-docker.md
89+
* `You must upgrade to v2.5.16 or later before upgrading to v2.6.11.`
90+
91+
* Steps for upgrading: https://milvus.io/docs/upgrade_milvus_standalone-docker.md#Upgrade-process
92+
93+
* Issue: I've moved from Milvs 2.5.4 to 2.6.11 following https://milvus.io/docs/upgrade_milvus_standalone-docker.md. Previous collections created in 2.5.4 can't be loaded. It runs forever.
94+
95+
* https://github.com/milvus-io/milvus/issues/43295
96+
97+
* https://www.perplexity.ai/search/i-ve-moved-from-milvs-2-5-4-to-CDHCle5hQl.qsUa_nw4WHQ
98+
99+
100+
101+
102+
* Done successfully
103+
104+
-----
105+
106+
* Setting "datatype=DataType.TIMESTAMPTZ" datatype for the field created_at
107+
108+
* Search
109+
* search_params for search https://milvus.io/api-reference/pymilvus/v2.6.x/MilvusClient/Vector/search.md#Request-syntax
110+
* param via AnnSearchRequest: https://milvus.io/api-reference/pymilvus/v2.6.x/MilvusClient/Vector/hybrid_search.md#Request-Syntax
111+
112+
113+
-----
114+
115+
* Finally i manage to make it work following the migration steps
116+
117+
* Logical operators
118+
* Logical operators are used to combine multiple conditions into a more complex filter expression. These include AND, OR, and NOT.
119+
120+
* Range operators
121+
* https://milvus.io/docs/basic-operators.md#Range-operators
122+
* Supported Range Operators:
123+
* IN: Used to match values within a specific set or range.
124+
* LIKE: Used to match a pattern (mostly for text fields). Milvus allows you to build an NGRAM index on VARCHAR or JSON fields to accelerate text queries. For details, refer to [NGRAM](https://milvus.io/docs/ngram.md).
125+
126+
127+
## Time
128+
129+
Time fields
130+
131+
* datetime
132+
* modified_at
133+
* created_at
134+
==> Added
135+
* indexed_at
136+
137+
138+
# Reorder

openrag/components/indexer/utils/files.py

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
import re
22
import secrets
33
import time
4+
from datetime import UTC, datetime
45
from pathlib import Path
56

67
import aiofiles
78
import consts
89
from components.utils import load_config
9-
from fastapi import UploadFile
10+
from fastapi import HTTPException, UploadFile, status
1011

1112
config = load_config()
1213
SERIALIZE_TIMEOUT = config.ray.indexer.get("serialize_timeout", 3600)
@@ -84,3 +85,27 @@ async def serialize_file(task_id: str, path: str, metadata: dict | None = None):
8485
timeout=SERIALIZE_TIMEOUT,
8586
task_description=f"Serialization task {task_id}",
8687
)
88+
89+
90+
def extract_temporal_fields(metadata: dict, temporal_fields: list) -> dict:
91+
result = {}
92+
93+
## Use provided created_at if available, otherwise extract from file system
94+
for field in temporal_fields:
95+
if field not in metadata or metadata[field] is None:
96+
continue
97+
98+
datetime_str = metadata[field]
99+
try:
100+
# Try parsing the provided datetime to ensure it's valid
101+
d = datetime.fromisoformat(datetime_str)
102+
if d.tzinfo is None:
103+
d = d.replace(tzinfo=UTC)
104+
result[field] = d.isoformat()
105+
except Exception:
106+
raise HTTPException(
107+
status_code=status.HTTP_400_BAD_REQUEST,
108+
detail=f"Invalid ISO 8601 datetime field ({datetime_str}) for field '{field}'.",
109+
)
110+
111+
return result

openrag/components/indexer/utils/test_files.py

Lines changed: 29 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
from pathlib import Path
33

44
import pytest
5-
from fastapi import UploadFile
5+
from fastapi import HTTPException, UploadFile
66

7-
from .files import sanitize_filename, save_file_to_disk
7+
from .files import extract_temporal_fields, sanitize_filename, save_file_to_disk
88

99

1010
@pytest.mark.asyncio
@@ -83,3 +83,30 @@ def fake_make_unique_filename(filename: str) -> str:
8383
)
8484
def test_sanitize_filename(input_name, expected):
8585
assert sanitize_filename(input_name) == expected
86+
87+
88+
# --- extract_temporal_fields ---
89+
90+
91+
def test_extract_temporal_fields_field_not_in_metadata():
92+
assert extract_temporal_fields({}, ["created_at"]) == {}
93+
94+
95+
def test_extract_temporal_fields_naive_datetime_defaults_to_utc():
96+
metadata = {"created_at": "2024-06-15T12:30:00"}
97+
result = extract_temporal_fields(metadata, ["created_at"])
98+
assert result == {"created_at": "2024-06-15T12:30:00+00:00"}
99+
100+
101+
def test_extract_temporal_fields_with_timezone():
102+
metadata = {"created_at": "2024-06-15T12:30:00+02:00"}
103+
result = extract_temporal_fields(metadata, ["created_at"])
104+
assert result == {"created_at": "2024-06-15T12:30:00+02:00"}
105+
106+
107+
def test_extract_temporal_fields_invalid_datetime_raises_400():
108+
with pytest.raises(HTTPException) as exc_info:
109+
extract_temporal_fields({"created_at": "not-a-date"}, ["created_at"])
110+
assert exc_info.value.status_code == 400
111+
assert "not-a-date" in exc_info.value.detail
112+
assert "created_at" in exc_info.value.detail

0 commit comments

Comments
 (0)