Skip to content

Commit 0304e3a

Browse files
authored
Merge pull request #45 from AI4quantum/gliu/semantic-chunking-only
enhancement: integrate semantic chunking
2 parents d46e0c8 + e711ad7 commit 0304e3a

File tree

8 files changed

+675
-4
lines changed

8 files changed

+675
-4
lines changed

README.md

Lines changed: 57 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ A modular vector database interface supporting multiple backends (Weaviate, Milv
66

77
- **Multi-backend support**: Weaviate and Milvus vector databases
88
- **Flexible embedding strategies**: Support for pre-computed vectors and multiple embedding models
9-
- **Pluggable document chunking**: None (default), Fixed (size/overlap), Sentence-aware
9+
- **Pluggable document chunking**: None (default), Fixed (size/overlap), Sentence-aware, Semantic (AI-powered)
1010
- **Unified API**: Consistent interface across different vector database implementations
1111
- **Factory pattern**: Easy creation and switching between database types
1212
- **MCP Server**: Model Context Protocol server for AI agent integration with multi-database support
@@ -18,6 +18,62 @@ A modular vector database interface supporting multiple backends (Weaviate, Milv
1818
- **Environment variable substitution**: Dynamic configuration with `{{ENV_VAR_NAME}}` syntax
1919
- **Safety features**: Confirmation prompts for destructive operations with `--force` flag bypass
2020

21+
## Chunking Strategies
22+
23+
Maestro Knowledge supports multiple document chunking strategies to optimize how your documents are split for vector search:
24+
25+
### Available Strategies
26+
27+
- **None**: No chunking performed (default)
28+
- **Fixed**: Split documents into fixed-size chunks with optional overlap
29+
- **Sentence**: Split documents at sentence boundaries with size limits
30+
- **Semantic**: Identifies semantic boundaries using sentence embeddings
31+
32+
### Semantic Chunking
33+
34+
The semantic chunking strategy uses sentence transformers to intelligently split documents:
35+
36+
```python
37+
from src.chunking import ChunkingConfig, chunk_text
38+
39+
# Configure semantic chunking
40+
config = ChunkingConfig(
41+
strategy="Semantic",
42+
parameters={
43+
"chunk_size": 768, # Default for semantic (vs 512 for others)
44+
"overlap": 0, # Optional overlap between chunks
45+
"window_size": 1, # Context window for similarity calculation
46+
"threshold_percentile": 90.0, # Percentile threshold for splits
47+
"model_name": "all-MiniLM-L6-v2" # Sentence transformer model
48+
}
49+
)
50+
51+
# Chunk your text
52+
chunks = chunk_text("Your document text here...", config)
53+
```
54+
55+
**Key Benefits**:
56+
- Preserves semantic meaning across chunk boundaries
57+
- Automatically finds natural break points in text
58+
- Respects size limits while maintaining context
59+
- Uses 768 character default (optimal for semantic understanding)
60+
61+
**Note**: Semantic chunking uses sentence-transformers for chunking decisions, but the resulting chunks are embedded using your collection's embedding model (e.g., nomic-embed-text) for search operations.
62+
63+
### Testing Semantic Chunking
64+
65+
You can test the semantic chunking functionality using the CLI:
66+
67+
```bash
68+
# Check collection information to see chunking strategy
69+
cli/maestro-k collection info --vdb "Qiskit_studio_algo" --name "Qiskit_studio_algo"
70+
71+
# Search with semantic chunking to see results
72+
./cli/maestro-k search "quantum circuit" --vdb qiskit_studio_algo --collection qiskit_studio_algo --doc-limit 1
73+
```
74+
75+
**Note**: The semantic chunking strategy uses sentence-transformers for chunking decisions, while the collection's own embedding model is used for search operations.
76+
2177
## Quick Start
2278

2379
### Installation

cli/README.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ A command-line interface for interacting with the Maestro Knowledge MCP server w
99
- **List collections**: List all collections in a specific vector database
1010
- **List documents**: List documents in a specific collection of a vector database
1111
- **Query documents**: Query documents using natural language with semantic search
12-
- **Pluggable document chunking**: Configure per-collection chunking (None, Fixed with size/overlap, Sentence)
12+
- **Pluggable document chunking**: Configure per-collection chunking (None, Fixed with size/overlap, Sentence, Semantic)
1313
- Discover supported strategies with `maestro-k chunking list`
1414
- **Create vector databases**: Create vector databases from YAML configuration files
1515
- **Delete vector databases**: Delete vector databases by name
@@ -317,6 +317,33 @@ Override the MCP server URI via command-line flag:
317317
./maestro-k chunking list
318318
```
319319

320+
#### Chunking Strategies
321+
322+
**None**: No chunking is performed (default)
323+
**Fixed**: Split documents into fixed-size chunks with optional overlap
324+
**Sentence**: Split documents at sentence boundaries with size limits
325+
**Semantic**: AI-powered chunking that identifies semantic boundaries using sentence embeddings
326+
327+
#### Semantic Chunking Example
328+
329+
Semantic chunking uses sentence transformers to identify natural break points in documents:
330+
331+
```bash
332+
# Create collection with semantic chunking
333+
./maestro-k create collection my-database my-collection \
334+
--chunking-strategy=Semantic \
335+
--chunk-size=768 \
336+
--chunk-overlap=0
337+
338+
# The semantic strategy will:
339+
# - Split text into sentences
340+
# - Use AI embeddings to find semantic boundaries
341+
# - Respect the chunk_size limit while preserving meaning
342+
# - Default to 768 characters (vs 512 for other strategies)
343+
```
344+
345+
**Note**: Semantic chunking uses sentence-transformers for chunking decisions, but the resulting chunks are embedded using your collection's embedding model (e.g., nomic-embed-text) for search operations.
346+
320347
### Environment Variable Substitution in YAML Files
321348

322349
The CLI supports environment variable substitution in YAML files using the `{{ENV_VAR_NAME}}` syntax. This allows you to use environment variables directly in your configuration files:
@@ -986,3 +1013,20 @@ go test -v ./tests/...
9861013
## License
9871014

9881015
Apache 2.0 License - see the main project LICENSE file for details.
1016+
1017+
## Semantic Chunking Example
1018+
1019+
The CLI supports semantic chunking for intelligent document splitting:
1020+
1021+
```bash
1022+
# Create a collection with semantic chunking
1023+
cli/maestro-k collection create --vdb my-vdb --name my-collection
1024+
1025+
# Check collection information to see chunking strategy
1026+
cli/maestro-k collection info --vdb "Qiskit_studio_algo" --name "Qiskit_studio_algo"
1027+
1028+
# Search with semantic chunking to see results
1029+
./cli/maestro-k search "quantum circuit" --vdb qiskit_studio_algo --collection qiskit_studio_algo --doc-limit 1
1030+
```
1031+
1032+
**Note**: The semantic chunking strategy uses sentence-transformers for chunking decisions, while the collection's own embedding model is used for search operations.

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,9 @@ dependencies = [
2222
"jsonschema>=4.25.0",
2323
"fastmcp>=2.11.0",
2424
"six>=1.17.0",
25+
"sentence-transformers>=2.5.1",
26+
"scikit-learn>=1.5.0",
27+
"numpy>=1.26.0",
2528
]
2629

2730
[tool.ruff.lint]

src/chunking/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,13 @@
1010
# Re-export strategy names for discovery if needed
1111
from .none import none_chunk
1212
from .sentence import sentence_chunk
13+
from .semantic_chunking import semantic_chunk
1314

1415
__all__ = [
1516
"ChunkingConfig",
1617
"chunk_text",
1718
"none_chunk",
1819
"fixed_chunk",
1920
"sentence_chunk",
21+
"semantic_chunk",
2022
]

src/chunking/common.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,10 @@ def chunk_text(
4747

4848
# apply defaults when strategy is set and parameters missing
4949
if strategy != "None":
50-
# default chunk size 512 and overlap 0
51-
params = {"chunk_size": 512, "overlap": 0}
50+
if strategy == "Semantic":
51+
params = {"chunk_size": 768, "overlap": 0}
52+
else:
53+
params = {"chunk_size": 512, "overlap": 0}
5254
params.update(parameters)
5355
else:
5456
params = {}

0 commit comments

Comments
 (0)