Is it possible to use OpenAI embeddings for HybridChunker? #1964

simjak · 2025-07-21T12:00:32Z

simjak
Jul 21, 2025

Is it possible to use OpenAI embeddings for HybridChunker?

2025-07-21T12:03:28Z

dosubot[bot]
bot Jul 21, 2025

Yes, you can use OpenAI embeddings with HybridChunker. HybridChunker supports pluggable tokenizers, including OpenAITokenizer, which uses OpenAI's tiktoken library for tokenization. To enable this, install the extra dependencies with:

pip install 'docling-core[chunking-openai]'

Then, you can configure HybridChunker to use OpenAITokenizer like this:

import tiktoken
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker

tokenizer = OpenAITokenizer(
    tokenizer=tiktoken.encoding_for_model("gpt-4o"),
    max_tokens=128 * 1024,  # adjust as needed for your model/context window
)
chunker = HybridChunker(tokenizer=tokenizer)
chunks = list(chunker.chunk(dl_doc=your_doc))

HybridChunker will produce chunks aligned with OpenAI tokenization, ready for embedding with the OpenAI Embeddings API. For more details and examples, see the documentation and concepts guide.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

onestardao · 2025-08-05T06:49:47Z

onestardao
Aug 5, 2025

Yes, HybridChunker works smoothly with OpenAI embeddings — but for anyone hitting edge cases in chunk alignment, it's worth noting that OpenAI’s tokenizer (especially with tiktoken) can introduce subtle semantic misalignments, especially when dealing with domain-specific content or hybrid multilingual documents.

If your downstream performance (in RAG or search) starts to wobble, it might not be your retrieval logic — it could be the token boundary logic itself.

Been working on a set of semantic chunk correction layers to address that. Happy to share ideas if anyone’s interested.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to use OpenAI embeddings for HybridChunker? #1964

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is it possible to use OpenAI embeddings for HybridChunker? #1964

Uh oh!

simjak Jul 21, 2025

Replies: 2 comments

Uh oh!

dosubot[bot] bot Jul 21, 2025

Uh oh!

onestardao Aug 5, 2025

simjak
Jul 21, 2025

dosubot[bot]
bot Jul 21, 2025

onestardao
Aug 5, 2025