Skip to content

Commit 16efb94

Browse files
planetf1ajbozarth
andauthored
feature: add Chunking support (#37)
* Add Chunking support Signed-off-by: Nigel Jones <[email protected]> * Ruff format after chunking Signed-off-by: Nigel Jones <[email protected]> * Chunking refactor Signed-off-by: Nigel Jones <[email protected]> * Chunking - clean up search Signed-off-by: Nigel Jones <[email protected]> * Chunking - further cleanup Signed-off-by: Nigel Jones <[email protected]> * Chunking - ruff Signed-off-by: Nigel Jones <[email protected]> * chunking : add clarity to sentence splitter Signed-off-by: Nigel Jones <[email protected]> * chunking : add more unit tests and fix bugs Signed-off-by: Nigel Jones <[email protected]> * chunking: refactor tests into separate files by strategy Signed-off-by: Nigel Jones <[email protected]> * chunking: ruff reformat Signed-off-by: Nigel Jones <[email protected]> * chunking: tidy readme Signed-off-by: Nigel Jones <[email protected]> * chunking: weaviate updates to resync Signed-off-by: Nigel Jones <[email protected]> * chunking: improve consistency of search result scores between backends Signed-off-by: Nigel Jones <[email protected]> * chunking: clean up search results & document Signed-off-by: Nigel Jones <[email protected]> * chunking: review fixes Signed-off-by: Nigel Jones <[email protected]> * chunking: remove weaviate test file Signed-off-by: Nigel Jones <[email protected]> * chunking: fix issue with missing collection on interactive commands Signed-off-by: Nigel Jones <[email protected]> * chunking: further review comments Signed-off-by: Nigel Jones <[email protected]> * Update README.md * Update cli/README.md * Apply suggestions from code review --------- Signed-off-by: Nigel Jones <[email protected]> Co-authored-by: Alex Bozarth <[email protected]>
1 parent c326422 commit 16efb94

34 files changed

+4344
-1365
lines changed

README.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ A modular vector database interface supporting multiple backends (Weaviate, Milv
66

77
- **Multi-backend support**: Weaviate and Milvus vector databases
88
- **Flexible embedding strategies**: Support for pre-computed vectors and multiple embedding models
9+
- **Pluggable document chunking**: None (default), Fixed (size/overlap), Sentence-aware
910
- **Unified API**: Consistent interface across different vector database implementations
1011
- **Factory pattern**: Easy creation and switching between database types
1112
- **MCP Server**: Model Context Protocol server for AI agent integration with multi-database support
@@ -154,6 +155,7 @@ echo "Your document content here" > my_doc.txt
154155
The project includes a Go-based CLI tool for managing vector databases through the MCP server. For comprehensive CLI usage, installation, and examples, see [cli/README.md](cli/README.md).
155156

156157
**Quick CLI Examples:**
158+
157159
```bash
158160
# Build and use the CLI
159161
cd cli && go build -o maestro-k src/*.go
@@ -166,6 +168,9 @@ cd cli && go build -o maestro-k src/*.go
166168

167169
# Query documents
168170
./maestro-k query "What is the main topic?" --vdb=my-database
171+
172+
# Resync any Milvus collections into the MCP server's in-memory registry (use after server restart)
173+
./maestro-k resync-databases
169174
```
170175

171176
### MCP Server
@@ -182,8 +187,35 @@ The project includes a Model Context Protocol (MCP) server that exposes vector d
182187

183188
# Check server status
184189
./stop.sh status
190+
191+
# Manual resync tool (available as an MCP tool and through the CLI `resync-databases` command):
192+
# After restarting the MCP server, run the resync to register existing Milvus collections:
193+
./maestro-k resync-databases
185194
```
186195

196+
### Search and Query Output
197+
198+
- Search returns JSON results suitable for programmatic use.
199+
- Query returns a human-readable text summary (no JSON flag).
200+
201+
Search result schema (normalized across Weaviate and Milvus):
202+
203+
- id: unique chunk identifier
204+
- url: source URL or file path
205+
- text: chunk text
206+
- metadata:
207+
- doc_name: original document name/slug
208+
- chunk_sequence_number: 1-based chunk index within the document
209+
- total_chunks: total chunks for the document
210+
- offset_start / offset_end: character offsets in the original text
211+
- chunk_size: size of the chunk in characters
212+
- similarity: canonical relevance score in [0..1]
213+
- distance: cosine distance (approximately 1 − similarity); included for convenience
214+
- rank: 1-based rank in the current result set
215+
- _metric: similarity metric name (e.g., "cosine")
216+
- _search_mode: "vector" (vector similarity) or "keyword" (fallback)
217+
218+
187219
## Embedding Strategies
188220

189221
The library supports flexible embedding strategies for both vector databases. For detailed embedding model support and usage examples, see [src/maestro_mcp/README.md](src/maestro_mcp/README.md).
@@ -194,14 +226,15 @@ The library supports flexible embedding strategies for both vector databases. Fo
194226
- **Milvus**: Supports pre-computed vectors and OpenAI embedding models
195227
- **Environment Variables**: Set `OPENAI_API_KEY` for OpenAI embedding models
196228

197-
### Basic Usage
229+
### Embedding Usage
198230

199231
```python
200232
# Check supported embeddings
201233
supported = db.supported_embeddings()
202234
print(f"Supported embeddings: {supported}")
203235

204236
# Write documents with specific embedding
237+
(Deprecated) Embedding is configured per collection. Any per-document embedding specified in writes is ignored.
205238
db.write_documents(documents, embedding="text-embedding-3-small")
206239
```
207240

@@ -306,6 +339,7 @@ maestro-knowledge/
306339
│ │ ├── server.py # Main MCP server
307340
│ │ ├── mcp_config.json # MCP client configuration
308341
│ │ └── README.md # MCP server documentation
342+
│ ├── chunking/ # Pluggable document chunking package
309343
│ └── vector_db.py # Main module exports
310344
├── cli/ # Go CLI tool
311345
│ ├── src/ # Go source code
@@ -404,6 +438,7 @@ The project includes comprehensive log monitoring capabilities:
404438
```
405439

406440
**Log Monitoring Features:**
441+
407442
- **📡 Real-time tailing** - Monitor logs as they're generated
408443
- **✅ Visual status indicators** - Clear service status with checkmarks and X marks
409444
- **🌐 Port monitoring** - Check service availability on ports

0 commit comments

Comments
 (0)