Fix OpenAITokenizerWrapper compatibility with Docling HybridChunker by deril2605 · Pull Request #23 · daveebbelaar/ai-cookbook

deril2605 · 2026-01-06T10:55:13Z

Docling’s HybridChunker evaluates the provided tokenizer in a boolean context (e.g. tokenizer or default_tokenizer). Since OpenAITokenizerWrapper subclasses HuggingFace’s PreTrainedTokenizerBase, this triggered the base class len() method, which raises NotImplementedError by default, causing chunking to fail at initialization.

This change adds explicit len() and bool() implementations to make the tokenizer safe for truthiness checks and compatible with Docling’s internal validation logic.

Additionally, this fixes the vocabulary size calculation to align with HuggingFace semantics (vocab_size = max_token_id + 1) and corrects the get_vocab() mapping to return token → id pairs as expected.

No tokenization behavior or chunking logic is changed; this is a compatibility and robustness fix only. The tokenizer continues to use tiktoken for local token counting and remains model-agnostic, suitable for OpenAI and Azure OpenAI workflows.

Docling’s HybridChunker evaluates the provided tokenizer in a boolean context (e.g. `tokenizer or default_tokenizer`). Since OpenAITokenizerWrapper subclasses HuggingFace’s PreTrainedTokenizerBase, this triggered the base class __len__() method, which raises NotImplementedError by default, causing chunking to fail at initialization. This change adds explicit __len__() and __bool__() implementations to make the tokenizer safe for truthiness checks and compatible with Docling’s internal validation logic. Additionally, this fixes the vocabulary size calculation to align with HuggingFace semantics (vocab_size = max_token_id + 1) and corrects the get_vocab() mapping to return token → id pairs as expected. No tokenization behavior or chunking logic is changed; this is a compatibility and robustness fix only. The tokenizer continues to use tiktoken for local token counting and remains model-agnostic, suitable for OpenAI and Azure OpenAI workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix OpenAITokenizerWrapper compatibility with Docling HybridChunker#23

Fix OpenAITokenizerWrapper compatibility with Docling HybridChunker#23
deril2605 wants to merge 1 commit intodaveebbelaar:mainfrom
deril2605:patch-1

deril2605 commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

deril2605 commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant