Skip to content

Commit 2591c70

Browse files
authored
feat: add default tokenizer to HybridChunker (#107)
Signed-off-by: Panos Vagenas <[email protected]>
1 parent a66b0bb commit 2591c70

File tree

3 files changed

+477
-1
lines changed

3 files changed

+477
-1
lines changed

docling_core/transforms/chunker/hybrid_chunker.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,9 @@ class HybridChunker(BaseChunker):
4444

4545
model_config = ConfigDict(arbitrary_types_allowed=True)
4646

47-
tokenizer: Union[PreTrainedTokenizerBase, str]
47+
tokenizer: Union[PreTrainedTokenizerBase, str] = (
48+
"sentence-transformers/all-MiniLM-L6-v2"
49+
)
4850
max_tokens: int = None # type: ignore[assignment]
4951
merge_peers: bool = True
5052

0 commit comments

Comments
 (0)