Skip to content

Commit 23ae35e

Browse files
committed
feat(metadata): add filtering and counting by metadata
- Implement `filter_by_metadata()` for efficient key-value filtering using JSON_EXTRACT. - Add `count_by_metadata()` to count records matching metadata filters. - Introduce `similarity_search_with_filter()` for combined similarity and metadata filtering. - Enhance validation for metadata filters to prevent SQL injection and ensure valid keys. - Update documentation and examples for new metadata filtering features. - Add comprehensive tests for metadata filtering functionality.
1 parent 7035599 commit 23ae35e

File tree

13 files changed

+612
-20
lines changed

13 files changed

+612
-20
lines changed

CHANGELOG.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,30 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [2.2.0] - 2025-02-01
9+
10+
### Added
11+
- **Metadata filtering**: New `filter_by_metadata()` method for efficient key-value filtering using SQLite's JSON_EXTRACT
12+
- **Metadata counting**: New `count_by_metadata()` method to count records matching metadata filters
13+
- **Combined search**: New `similarity_search_with_filter()` method combining vector similarity with metadata filtering
14+
- Support for nested JSON paths in metadata filters (e.g., `{"author.name": "Alice"}`)
15+
- Pagination support in `filter_by_metadata()` with `limit` and `offset` parameters
16+
- New validation function `validate_metadata_filters()` for secure filter validation
17+
- New utility function `build_metadata_where_clause()` for safe SQL generation
18+
- Comprehensive test coverage for metadata filtering (unit, integration, and security tests)
19+
- New example `advanced_metadata_queries.py` demonstrating nested paths and complex queries
20+
- Updated `metadata_filtering.py` example with new filtering methods
21+
22+
### Security
23+
- SQL injection prevention in metadata filter keys
24+
- Validation of JSON paths to prevent malicious queries
25+
- Parameterized queries for all metadata filtering operations
26+
27+
### Documentation
28+
- Added "Metadata Filtering" section to README with examples
29+
- Updated examples list in README
30+
- Added comprehensive docstrings for new methods
31+
832
## [2.1.1] - 2025-01-31
933

1034
### Changed
@@ -166,6 +190,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
166190

167191
## Version History
168192

193+
- **2.2.0** - Added metadata filtering with JSON_EXTRACT support
194+
- **2.1.1** - Moved table name validation to create_table()
195+
- **2.1.0** - Added connection pooling support
169196
- **2.0.0** - Major refactor: simplified API, removed niche methods, cleaner naming
170197
- **1.2.0** - Added benchmarks module
171198
- **1.0.0** - First stable release

README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,40 @@ rows = client.get_many(rowids)
6666
client.close()
6767
```
6868

69+
## Metadata Filtering
70+
71+
Efficiently filter records by metadata fields using SQLite's JSON functions:
72+
73+
```python
74+
# Filter by single field
75+
results = client.filter_by_metadata({"category": "python"})
76+
77+
# Filter by multiple fields
78+
results = client.filter_by_metadata({"category": "python", "year": 2024})
79+
80+
# Nested JSON paths
81+
results = client.filter_by_metadata({"author.name": "Alice"})
82+
83+
# Count matching records
84+
count = client.count_by_metadata({"category": "python"})
85+
86+
# Combined similarity search + metadata filtering
87+
hits = client.similarity_search_with_filter(
88+
embedding=query_vector,
89+
filters={"category": "python"},
90+
top_k=5
91+
)
92+
93+
# Pagination
94+
results = client.filter_by_metadata(
95+
{"category": "python"},
96+
limit=10,
97+
offset=0
98+
)
99+
```
100+
101+
See [examples/metadata_filtering.py](examples/metadata_filtering.py) and [examples/advanced_metadata_queries.py](examples/advanced_metadata_queries.py) for more examples.
102+
69103
## Bulk Operations
70104

71105
The client provides optimized methods for bulk operations:
@@ -219,6 +253,8 @@ Edit [benchmarks/config.yaml](benchmarks/config.yaml) to customize:
219253
- [TESTING.md](TESTING.md) - Testing documentation
220254
- [Examples](examples/) - Usage examples
221255
- [basic_usage.py](examples/basic_usage.py) - Basic CRUD operations
256+
- [metadata_filtering.py](examples/metadata_filtering.py) - Metadata filtering and queries
257+
- [advanced_metadata_queries.py](examples/advanced_metadata_queries.py) - Advanced metadata filtering with nested paths
222258
- [transaction_example.py](examples/transaction_example.py) - Transaction management with all CRUD operations
223259
- [batch_operations.py](examples/batch_operations.py) - Bulk operations
224260
- [logging_example.py](examples/logging_example.py) - Logging configuration

TODO

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,8 +69,8 @@
6969
- [x] Benchmark tests
7070

7171
### New Features
72-
- [ ] Partial search on JSON metadata (JSON_EXTRACT)
73-
- [ ] Metadata field filtering (key-value based)
72+
- [x] Partial search on JSON metadata (JSON_EXTRACT)
73+
- [x] Metadata field filtering (key-value based)
7474
- [x] Transaction context manager
7575
- [ ] Async/await support (aiosqlite)
7676
- [ ] Export/import functions (JSON, CSV)
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
"""Advanced metadata querying examples for sqlite-vec-client.
2+
3+
Demonstrates:
4+
- Nested JSON path queries
5+
- Complex filtering scenarios
6+
- Performance comparison: filter vs manual iteration
7+
- Real-world use cases
8+
"""
9+
10+
from sqlite_vec_client import SQLiteVecClient
11+
12+
13+
def main():
14+
client = SQLiteVecClient(table="documents", db_path=":memory:")
15+
client.create_table(dim=128, distance="cosine")
16+
17+
# Add documents with nested metadata
18+
texts = [
19+
"Introduction to Python",
20+
"Advanced JavaScript",
21+
"Python for Data Science",
22+
"Java Programming Guide",
23+
"Machine Learning with Python",
24+
]
25+
26+
embeddings = [[0.1 * i] * 128 for i in range(len(texts))]
27+
28+
metadata = [
29+
{"author": {"name": "Alice", "country": "US"}, "tags": ["python", "beginner"]},
30+
{
31+
"author": {"name": "Bob", "country": "UK"},
32+
"tags": ["javascript", "advanced"],
33+
},
34+
{"author": {"name": "Alice", "country": "US"}, "tags": ["python", "data"]},
35+
{"author": {"name": "Charlie", "country": "CA"}, "tags": ["java", "beginner"]},
36+
{"author": {"name": "Alice", "country": "US"}, "tags": ["python", "ml"]},
37+
]
38+
39+
rowids = client.add(texts=texts, embeddings=embeddings, metadata=metadata)
40+
print(f"Added {len(rowids)} documents\n")
41+
42+
# Example 1: Nested JSON path queries
43+
print("=== Nested JSON Path Queries ===")
44+
print("\nDocuments by Alice (nested path):")
45+
results = client.filter_by_metadata({"author.name": "Alice"})
46+
for rowid, text, meta, _ in results:
47+
print(f" [{rowid}] {text}")
48+
print(f"Total: {len(results)} documents")
49+
50+
# Example 2: Filter by country
51+
print("\n\nDocuments from US authors:")
52+
results = client.filter_by_metadata({"author.country": "US"})
53+
for rowid, text, meta, _ in results:
54+
print(f" [{rowid}] {text} by {meta['author']['name']}")
55+
56+
# Example 3: Count by author
57+
print("\n\n=== Count by Author ===")
58+
for author in ["Alice", "Bob", "Charlie"]:
59+
count = client.count_by_metadata({"author.name": author})
60+
print(f" {author}: {count} documents")
61+
62+
# Example 4: Combined similarity + metadata filtering
63+
print("\n\n=== Combined Similarity + Metadata Filtering ===")
64+
query_emb = [0.15] * 128
65+
print("\nSimilar documents by Alice (top 10 candidates, filtered):")
66+
hits = client.similarity_search_with_filter(
67+
embedding=query_emb, filters={"author.name": "Alice"}, top_k=10
68+
)
69+
if hits:
70+
for rowid, text, distance in hits:
71+
dist_str = f"{distance:.4f}" if distance is not None else "N/A"
72+
print(f" [{rowid}] {text} (distance: {dist_str})")
73+
else:
74+
print(" No results found (filters may be too restrictive)")
75+
76+
# Example 5: Pagination for large result sets
77+
print("\n\n=== Pagination Example ===")
78+
all_us_docs = client.count_by_metadata({"author.country": "US"})
79+
print(f"Total US documents: {all_us_docs}")
80+
print("\nFetching in pages of 2:")
81+
for page in range(0, all_us_docs, 2):
82+
results = client.filter_by_metadata(
83+
{"author.country": "US"}, limit=2, offset=page
84+
)
85+
print(f" Page {page//2 + 1}: {[r[1] for r in results]}")
86+
87+
# Example 6: Alternative - regular similarity search
88+
print("\n\n=== Regular Similarity Search (no filter) ===")
89+
hits = client.similarity_search(embedding=query_emb, top_k=3)
90+
print("Top 3 similar documents:")
91+
for rowid, text, distance in hits:
92+
if distance is not None:
93+
print(f" [{rowid}] {text} (distance: {distance:.4f})")
94+
else:
95+
print(f" [{rowid}] {text}")
96+
97+
# Performance note
98+
print("\n\n=== Performance Note ===")
99+
print("filter_by_metadata() uses SQLite's json_extract() for efficient queries.")
100+
print("This is much faster than manually iterating with get_all().")
101+
print("\nFor frequently queried fields, consider:")
102+
print(" 1. Using filter_by_metadata() for ad-hoc queries")
103+
print(" 2. Creating computed columns for indexed queries (advanced)")
104+
105+
client.close()
106+
107+
108+
if __name__ == "__main__":
109+
main()

examples/metadata_filtering.py

Lines changed: 37 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@
22
33
Demonstrates:
44
- Adding records with metadata
5-
- Querying records with get_all
5+
- Filtering by metadata fields
6+
- Counting records by metadata
7+
- Combined similarity search with metadata filtering
68
- Updating metadata
79
"""
810

@@ -35,19 +37,32 @@ def main():
3537
rowids = client.add(texts=texts, embeddings=embeddings, metadata=metadata)
3638
print(f"Added {len(rowids)} articles")
3739

38-
# Query all articles and filter by author
39-
print("\nAlice's articles:")
40-
for rowid, text, meta, _ in client.get_all():
41-
if meta.get("author") == "Alice":
42-
print(f" [{rowid}] {text} - {meta}")
40+
# Filter by metadata - efficient JSON_EXTRACT queries
41+
print("\nAlice's articles (using filter_by_metadata):")
42+
results = client.filter_by_metadata({"author": "Alice"})
43+
for rowid, text, meta, _ in results:
44+
print(f" [{rowid}] {text} - {meta}")
4345

44-
# Query all articles and filter by text
45-
print("\nPython-related articles:")
46-
for rowid, text, meta, _ in client.get_all():
47-
if "Python" in text:
48-
print(f" [{rowid}] {text}")
46+
# Filter by multiple fields
47+
print("\nArticles from 2024:")
48+
results = client.filter_by_metadata({"year": 2024})
49+
for rowid, text, meta, _ in results:
50+
print(f" [{rowid}] {text} - Year: {meta['year']}")
4951

50-
# Update metadata
52+
# Count records by metadata
53+
count = client.count_by_metadata({"author": "Alice"})
54+
print(f"\nTotal articles by Alice: {count}")
55+
56+
# Combined similarity search with metadata filtering
57+
print("\nSimilar to 'Python' in category 'programming':")
58+
query_emb = [0.1] * 128
59+
hits = client.similarity_search_with_filter(
60+
embedding=query_emb, filters={"category": "programming"}, top_k=5
61+
)
62+
for rowid, text, distance in hits:
63+
print(f" [{rowid}] {text} (distance: {distance:.4f})")
64+
65+
# Update metadata and verify with filter
5166
if rowids:
5267
client.update(
5368
rowids[0],
@@ -58,9 +73,16 @@ def main():
5873
"updated": True,
5974
},
6075
)
61-
updated = client.get(rowids[0])
62-
if updated:
63-
print(f"\nUpdated metadata: {updated[2]}")
76+
# Find updated records
77+
updated_records = client.filter_by_metadata({"updated": True})
78+
print(f"\nUpdated records: {len(updated_records)}")
79+
if updated_records:
80+
print(f" Metadata: {updated_records[0][2]}")
81+
82+
# Pagination example
83+
print("\nPagination example (limit=2):")
84+
page1 = client.filter_by_metadata({"year": 2024}, limit=2, offset=0)
85+
print(f" Page 1: {len(page1)} results")
6486

6587
client.close()
6688

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "sqlite-vec-client"
7-
version = "2.1.1"
7+
version = "2.2.0"
88
description = "A lightweight Python client around sqlite-vec for CRUD and similarity search."
99
readme = "README.md"
1010
requires-python = ">=3.9"

0 commit comments

Comments
 (0)