You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Environment variable substitution**: Dynamic configuration with `{{ENV_VAR_NAME}}` syntax
19
19
-**Safety features**: Confirmation prompts for destructive operations with `--force` flag bypass
20
20
21
+
## Chunking Strategies
22
+
23
+
Maestro Knowledge supports multiple document chunking strategies to optimize how your documents are split for vector search:
24
+
25
+
### Available Strategies
26
+
27
+
-**None**: No chunking performed (default)
28
+
-**Fixed**: Split documents into fixed-size chunks with optional overlap
29
+
-**Sentence**: Split documents at sentence boundaries with size limits
30
+
-**Semantic**: Identifies semantic boundaries using sentence embeddings
31
+
32
+
### Semantic Chunking
33
+
34
+
The semantic chunking strategy uses sentence transformers to intelligently split documents:
35
+
36
+
```python
37
+
from src.chunking import ChunkingConfig, chunk_text
38
+
39
+
# Configure semantic chunking
40
+
config = ChunkingConfig(
41
+
strategy="Semantic",
42
+
parameters={
43
+
"chunk_size": 768, # Default for semantic (vs 512 for others)
44
+
"overlap": 0, # Optional overlap between chunks
45
+
"window_size": 1, # Context window for similarity calculation
46
+
"threshold_percentile": 90.0, # Percentile threshold for splits
47
+
"model_name": "all-MiniLM-L6-v2"# Sentence transformer model
48
+
}
49
+
)
50
+
51
+
# Chunk your text
52
+
chunks = chunk_text("Your document text here...", config)
53
+
```
54
+
55
+
**Key Benefits**:
56
+
- Preserves semantic meaning across chunk boundaries
57
+
- Automatically finds natural break points in text
58
+
- Respects size limits while maintaining context
59
+
- Uses 768 character default (optimal for semantic understanding)
60
+
61
+
**Note**: Semantic chunking uses sentence-transformers for chunking decisions, but the resulting chunks are embedded using your collection's embedding model (e.g., nomic-embed-text) for search operations.
62
+
63
+
### Testing Semantic Chunking
64
+
65
+
You can test the semantic chunking functionality using the CLI:
66
+
67
+
```bash
68
+
# Check collection information to see chunking strategy
69
+
cli/maestro-k collection info --vdb "Qiskit_studio_algo" --name "Qiskit_studio_algo"
**Note**: The semantic chunking strategy uses sentence-transformers for chunking decisions, while the collection's own embedding model is used for search operations.
# - Respect the chunk_size limit while preserving meaning
342
+
# - Default to 768 characters (vs 512 for other strategies)
343
+
```
344
+
345
+
**Note**: Semantic chunking uses sentence-transformers for chunking decisions, but the resulting chunks are embedded using your collection's embedding model (e.g., nomic-embed-text) for search operations.
346
+
320
347
### Environment Variable Substitution in YAML Files
321
348
322
349
The CLI supports environment variable substitution in YAML files using the `{{ENV_VAR_NAME}}` syntax. This allows you to use environment variables directly in your configuration files:
@@ -986,3 +1013,20 @@ go test -v ./tests/...
986
1013
## License
987
1014
988
1015
Apache 2.0 License - see the main project LICENSE file for details.
1016
+
1017
+
## Semantic Chunking Example
1018
+
1019
+
The CLI supports semantic chunking for intelligent document splitting:
**Note**: The semantic chunking strategy uses sentence-transformers for chunking decisions, while the collection's own embedding model is used for search operations.
0 commit comments