Skip to content

Commit 70a3e9b

Browse files
very much wip
1 parent 3ca4c97 commit 70a3e9b

File tree

17 files changed

+3658
-258
lines changed

17 files changed

+3658
-258
lines changed

docs/user_guide/data_validation.ipynb

Lines changed: 1102 additions & 0 deletions
Large diffs are not rendered by default.

docs/validation.md

Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# RedisVL Validation System
2+
3+
The RedisVL validation system ensures that data written to Redis indexes conforms to the defined schema. It uses dynamic Pydantic model generation to validate objects before they are stored.
4+
5+
## Key Features
6+
7+
- **Schema-Based Validation**: Validates objects against your index schema definition
8+
- **Dynamic Model Generation**: Creates Pydantic models on the fly based on your schema
9+
- **Type Checking**: Ensures fields contain appropriate data types
10+
- **Field-Specific Validation**:
11+
- Text and Tag fields must be strings
12+
- Numeric fields must be integers or floats
13+
- Geo fields must be properly formatted latitude/longitude strings
14+
- Vector fields must have the correct dimensions and data types
15+
- **JSON Path Support**: Validates fields extracted from nested JSON structures
16+
- **Fail-Fast Approach**: Stops processing at the first validation error
17+
- **Performance Optimized**: Caches models for repeated validation
18+
19+
## Usage
20+
21+
### Basic Validation
22+
23+
```python
24+
from redisvl.schema.validation import validate_object
25+
26+
# Assuming you have a schema defined
27+
validated_data = validate_object(schema, data)
28+
```
29+
30+
### Storage Integration
31+
32+
The validation is automatically integrated with the storage classes:
33+
34+
```python
35+
from redisvl.index.storage import BaseStorage
36+
37+
# Create storage with schema
38+
storage = BaseStorage(schema=schema, client=redis_client)
39+
40+
# Write data - validation happens automatically
41+
storage.write_one(data)
42+
43+
# Or validate explicitly
44+
validated = storage.validate_object(data)
45+
```
46+
47+
## Field Type Validation
48+
49+
The validation system supports all Redis field types:
50+
51+
### Text Fields
52+
53+
Text fields are validated to ensure they contain string values:
54+
55+
```python
56+
# Valid
57+
{"title": "Hello World"}
58+
59+
# Invalid
60+
{"title": 123} # Not a string
61+
```
62+
63+
### Tag Fields
64+
65+
Tag fields are validated to ensure they contain string values:
66+
67+
```python
68+
# Valid
69+
{"category": "electronics"}
70+
71+
# Invalid
72+
{"category": 123} # Not a string
73+
```
74+
75+
### Numeric Fields
76+
77+
Numeric fields must contain integers or floats:
78+
79+
```python
80+
# Valid
81+
{"price": 19.99}
82+
{"quantity": 5}
83+
84+
# Invalid
85+
{"price": "19.99"} # String, not a number
86+
```
87+
88+
### Geo Fields
89+
90+
Geo fields must contain properly formatted latitude/longitude strings:
91+
92+
```python
93+
# Valid
94+
{"location": "37.7749,-122.4194"} # San Francisco
95+
{"location": "40.7128,-74.0060"} # New York
96+
97+
# Invalid
98+
{"location": "invalid"} # Not in lat,lon format
99+
{"location": "91.0,0.0"} # Latitude out of range (-90 to 90)
100+
{"location": "0.0,181.0"} # Longitude out of range (-180 to 180)
101+
```
102+
103+
### Vector Fields
104+
105+
Vector fields must contain arrays with the correct dimensions and data types:
106+
107+
```python
108+
# Valid
109+
{"embedding": [0.1, 0.2, 0.3, 0.4]} # 4-dimensional float vector
110+
{"embedding": b'\x00\x01\x02\x03'} # Raw bytes (dimensions not checked)
111+
112+
# Invalid
113+
{"embedding": [0.1, 0.2, 0.3]} # Wrong dimensions
114+
{"embedding": "not a vector"} # Wrong type
115+
{"embedding": [0.1, "text", 0.3]} # Mixed types
116+
```
117+
118+
For integer vectors, the values must be within the appropriate range:
119+
120+
- **INT8**: -128 to 127
121+
- **INT16**: -32,768 to 32,767
122+
123+
```python
124+
# Valid INT8 vector
125+
{"int_vector": [1, 2, 3]}
126+
127+
# Invalid INT8 vector
128+
{"int_vector": [1000, 2000, 3000]} # Values out of range
129+
```
130+
131+
## Nested JSON Validation
132+
133+
The validation system supports extracting and validating fields from nested JSON structures:
134+
135+
```python
136+
# Schema with JSON paths
137+
fields = {
138+
"id": Field(name="id", type=FieldTypes.TAG),
139+
"title": Field(name="title", type=FieldTypes.TEXT, path="$.content.title"),
140+
"rating": Field(name="rating", type=FieldTypes.NUMERIC, path="$.metadata.rating")
141+
}
142+
143+
# Nested JSON data
144+
data = {
145+
"id": "doc1",
146+
"content": {
147+
"title": "Hello World"
148+
},
149+
"metadata": {
150+
"rating": 4.5
151+
}
152+
}
153+
154+
# Validation extracts fields using JSON paths
155+
validated = validate_object(schema, data)
156+
# Result: {"id": "doc1", "title": "Hello World", "rating": 4.5}
157+
```
158+
159+
## Error Handling
160+
161+
The validation system uses a fail-fast approach, raising a `ValueError` when validation fails:
162+
163+
```python
164+
try:
165+
validated = validate_object(schema, data)
166+
except ValueError as e:
167+
print(f"Validation error: {e}")
168+
# Handle the error
169+
```
170+
171+
The error message includes information about the field that failed validation.
172+
173+
## Optional Fields
174+
175+
All fields are considered optional during validation. If a field is missing, it will be excluded from the validated result:
176+
177+
```python
178+
# Schema with multiple fields
179+
fields = {
180+
"id": Field(name="id", type=FieldTypes.TAG),
181+
"title": Field(name="title", type=FieldTypes.TEXT),
182+
"rating": Field(name="rating", type=FieldTypes.NUMERIC)
183+
}
184+
185+
# Data with missing fields
186+
data = {
187+
"id": "doc1",
188+
"title": "Hello World"
189+
# rating is missing
190+
}
191+
192+
# Validation succeeds with partial data
193+
validated = validate_object(schema, data)
194+
# Result: {"id": "doc1", "title": "Hello World"}
195+
```
196+
197+
## Performance Considerations
198+
199+
The validation system is optimized for performance:
200+
201+
- **Model Caching**: Pydantic models are cached by schema name to avoid regeneration
202+
- **Lazy Validation**: Fields are validated only when needed
203+
- **Fail-Fast Approach**: Processing stops at the first validation error
204+
205+
For large datasets, validation can be a significant part of the processing time. If you need to write many objects with the same structure, consider validating a sample first to ensure correctness.
206+
207+
## Limitations
208+
209+
- **JSON Path**: The current implementation only supports simple dot notation paths (e.g., `$.field.subfield`). Array indexing is not supported.
210+
- **Vector Bytes**: When vectors are provided as bytes, the dimensions cannot be validated.
211+
- **Custom Validators**: The current implementation does not support custom user-defined validators.
212+
213+
## Best Practices
214+
215+
1. **Define Clear Schemas**: Be explicit about field types and constraints
216+
2. **Pre-validate Critical Data**: For large datasets, validate a sample before processing everything
217+
3. **Handle Validation Errors**: Implement proper error handling for validation failures
218+
4. **Use JSON Paths Carefully**: Test nested JSON extraction to ensure paths are correctly defined
219+
5. **Consider Optional Fields**: Decide which fields are truly required for your application
220+
221+
## Integration with Storage Classes
222+
223+
The validation system is fully integrated with the storage classes:
224+
225+
- **BaseStorage**: For hash-based storage, validates each field individually
226+
- **JsonStorage**: For JSON storage, extracts and validates fields from nested structures
227+
228+
Each storage class automatically validates data before writing to Redis, ensuring data integrity.

redisvl/index/index.py

Lines changed: 23 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -118,8 +118,7 @@ def __init__(*args, **kwargs):
118118
def _storage(self) -> BaseStorage:
119119
"""The storage type for the index schema."""
120120
return self._STORAGE_MAP[self.schema.index.storage_type](
121-
prefix=self.schema.index.prefix,
122-
key_separator=self.schema.index.key_separator,
121+
index_schema=self.schema
123122
)
124123

125124
@property
@@ -251,6 +250,7 @@ def __init__(
251250
redis_client: Optional[redis.Redis] = None,
252251
redis_url: Optional[str] = None,
253252
connection_kwargs: Optional[Dict[str, Any]] = None,
253+
validate_on_load: bool = False,
254254
**kwargs,
255255
):
256256
"""Initialize the RedisVL search index with a schema, Redis client
@@ -265,6 +265,8 @@ def __init__(
265265
connect to.
266266
connection_kwargs (Dict[str, Any], optional): Redis client connection
267267
args.
268+
validate_on_load (bool, optional): Whether to validate data against schema
269+
when loading. Defaults to False.
268270
"""
269271
if "connection_args" in kwargs:
270272
connection_kwargs = kwargs.pop("connection_args")
@@ -273,15 +275,14 @@ def __init__(
273275
raise ValueError("Must provide a valid IndexSchema object")
274276

275277
self.schema = schema
276-
278+
self._validate_on_load = validate_on_load
277279
self._lib_name: Optional[str] = kwargs.pop("lib_name", None)
278280

279281
# Store connection parameters
280282
self.__redis_client = redis_client
281283
self._redis_url = redis_url
282284
self._connection_kwargs = connection_kwargs or {}
283285
self._lock = threading.Lock()
284-
285286
self._owns_redis_client = redis_client is None
286287
if self._owns_redis_client:
287288
weakref.finalize(self, self.disconnect)
@@ -574,7 +575,7 @@ def load(
574575
575576
Raises:
576577
ValueError: If the length of provided keys does not match the length
577-
of objects.
578+
of objects or if validation fails when validate_on_load is enabled.
578579
579580
.. code-block:: python
580581
@@ -604,6 +605,7 @@ def add_field(d):
604605
ttl=ttl,
605606
preprocess=preprocess,
606607
batch_size=batch_size,
608+
validate=self._validate_on_load,
607609
)
608610
except:
609611
logger.exception("Error while loading data to Redis")
@@ -828,6 +830,7 @@ def __init__(
828830
redis_url: Optional[str] = None,
829831
redis_client: Optional[aredis.Redis] = None,
830832
connection_kwargs: Optional[Dict[str, Any]] = None,
833+
validate_on_load: bool = False,
831834
**kwargs,
832835
):
833836
"""Initialize the RedisVL async search index with a schema.
@@ -840,6 +843,8 @@ def __init__(
840843
instantiated redis client.
841844
connection_kwargs (Optional[Dict[str, Any]]): Redis client connection
842845
args.
846+
validate_on_load (bool, optional): Whether to validate data against schema
847+
when loading. Defaults to False.
843848
"""
844849
if "redis_kwargs" in kwargs:
845850
connection_kwargs = kwargs.pop("redis_kwargs")
@@ -849,15 +854,14 @@ def __init__(
849854
raise ValueError("Must provide a valid IndexSchema object")
850855

851856
self.schema = schema
852-
857+
self._validate_on_load = validate_on_load
853858
self._lib_name: Optional[str] = kwargs.pop("lib_name", None)
854859

855860
# Store connection parameters
856861
self._redis_client = redis_client
857862
self._redis_url = redis_url
858863
self._connection_kwargs = connection_kwargs or {}
859864
self._lock = asyncio.Lock()
860-
861865
self._owns_redis_client = redis_client is None
862866
if self._owns_redis_client:
863867
weakref.finalize(self, sync_wrapper(self.disconnect))
@@ -1093,6 +1097,7 @@ async def expire_keys(
10931097
else:
10941098
return await client.expire(keys, ttl)
10951099

1100+
@deprecated_argument("concurrency", "Use batch_size instead.")
10961101
async def load(
10971102
self,
10981103
data: Iterable[Any],
@@ -1101,9 +1106,10 @@ async def load(
11011106
ttl: Optional[int] = None,
11021107
preprocess: Optional[Callable] = None,
11031108
concurrency: Optional[int] = None,
1109+
batch_size: Optional[int] = None,
11041110
) -> List[str]:
1105-
"""Asynchronously load objects to Redis with concurrency control.
1106-
Returns the list of keys loaded to Redis.
1111+
"""Asynchronously load objects to Redis. Returns the list of keys loaded
1112+
to Redis.
11071113
11081114
RedisVL automatically handles constructing the object keys, batching,
11091115
optional preprocessing steps, and setting optional expiration
@@ -1118,18 +1124,18 @@ async def load(
11181124
Must match the length of objects if provided. Defaults to None.
11191125
ttl (Optional[int], optional): Time-to-live in seconds for each key.
11201126
Defaults to None.
1121-
preprocess (Optional[Callable], optional): An async function to
1127+
preprocess (Optional[Callable], optional): A function to
11221128
preprocess objects before storage. Defaults to None.
1123-
concurrency (Optional[int], optional): The maximum number of
1124-
concurrent write operations. Defaults to class's default
1125-
concurrency level.
1129+
batch_size (Optional[int], optional): Number of objects to write in
1130+
a single Redis pipeline execution. Defaults to class's
1131+
default batch size.
11261132
11271133
Returns:
11281134
List[str]: List of keys loaded to Redis.
11291135
11301136
Raises:
11311137
ValueError: If the length of provided keys does not match the
1132-
length of objects.
1138+
length of objects or if validation fails when validate_on_load is enabled.
11331139
11341140
.. code-block:: python
11351141
@@ -1145,7 +1151,7 @@ async def load(
11451151
keys = await index.load(data, keys=["rvl:foo", "rvl:bar"])
11461152
11471153
# load data with preprocessing step
1148-
async def add_field(d):
1154+
def add_field(d):
11491155
d["new_field"] = 123
11501156
return d
11511157
keys = await index.load(data, preprocess=add_field)
@@ -1160,7 +1166,8 @@ async def add_field(d):
11601166
keys=keys,
11611167
ttl=ttl,
11621168
preprocess=preprocess,
1163-
concurrency=concurrency,
1169+
batch_size=batch_size,
1170+
validate=self._validate_on_load,
11641171
)
11651172
except:
11661173
logger.exception("Error while loading data to Redis")

0 commit comments

Comments
 (0)