Replies: 2 comments
-
Yes, you can use OpenAI embeddings with HybridChunker. HybridChunker supports pluggable tokenizers, including OpenAITokenizer, which uses OpenAI's tiktoken library for tokenization. To enable this, install the extra dependencies with: pip install 'docling-core[chunking-openai]' Then, you can configure HybridChunker to use OpenAITokenizer like this: import tiktoken
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
tokenizer = OpenAITokenizer(
tokenizer=tiktoken.encoding_for_model("gpt-4o"),
max_tokens=128 * 1024, # adjust as needed for your model/context window
)
chunker = HybridChunker(tokenizer=tokenizer)
chunks = list(chunker.chunk(dl_doc=your_doc)) HybridChunker will produce chunks aligned with OpenAI tokenization, ready for embedding with the OpenAI Embeddings API. For more details and examples, see the documentation and concepts guide. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
Yes, HybridChunker works smoothly with OpenAI embeddings — but for anyone hitting edge cases in chunk alignment, it's worth noting that OpenAI’s tokenizer (especially with If your downstream performance (in RAG or search) starts to wobble, it might not be your retrieval logic — it could be the token boundary logic itself. Been working on a set of semantic chunk correction layers to address that. Happy to share ideas if anyone’s interested. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Is it possible to use OpenAI embeddings for HybridChunker?
Beta Was this translation helpful? Give feedback.
All reactions