-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
The documentation explains how to configure and use OpenAI Tokenizer with hybrid chunker in python.
import tiktoken
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
tokenizer = OpenAITokenizer(
tokenizer=tiktoken.encoding_for_model("gpt-4o"),
max_tokens=128 * 1024, # context window length required for OpenAI tokenizers
)
chunker = HybridChunker(
tokenizer=tokenizer,
merge_peers=True, # optional, defaults to True
)
chunk_iter = chunker.chunk(dl_doc=doc)
chunks = list(chunk_iter)
How to do the same operation using docling-java ?
The HybridChunkerOptions.builder().tokenizer() seems support only HuggingFace models.
Metadata
Metadata
Assignees
Labels
No labels