Skip to content

Comments

Fix OpenAITokenizerWrapper compatibility with Docling HybridChunker#23

Open
deril2605 wants to merge 1 commit intodaveebbelaar:mainfrom
deril2605:patch-1
Open

Fix OpenAITokenizerWrapper compatibility with Docling HybridChunker#23
deril2605 wants to merge 1 commit intodaveebbelaar:mainfrom
deril2605:patch-1

Conversation

@deril2605
Copy link

Docling’s HybridChunker evaluates the provided tokenizer in a boolean context (e.g. tokenizer or default_tokenizer). Since OpenAITokenizerWrapper subclasses HuggingFace’s PreTrainedTokenizerBase, this triggered the base class len() method, which raises NotImplementedError by default, causing chunking to fail at initialization.

This change adds explicit len() and bool() implementations to make the tokenizer safe for truthiness checks and compatible with Docling’s internal validation logic.

Additionally, this fixes the vocabulary size calculation to align with HuggingFace semantics (vocab_size = max_token_id + 1) and corrects the get_vocab() mapping to return token → id pairs as expected.

No tokenization behavior or chunking logic is changed; this is a compatibility and robustness fix only. The tokenizer continues to use tiktoken for local token counting and remains model-agnostic, suitable for OpenAI and Azure OpenAI workflows.

Docling’s HybridChunker evaluates the provided tokenizer in a boolean
context (e.g. `tokenizer or default_tokenizer`). Since
OpenAITokenizerWrapper subclasses HuggingFace’s PreTrainedTokenizerBase,
this triggered the base class __len__() method, which raises
NotImplementedError by default, causing chunking to fail at initialization.

This change adds explicit __len__() and __bool__() implementations to make
the tokenizer safe for truthiness checks and compatible with Docling’s
internal validation logic.

Additionally, this fixes the vocabulary size calculation to align with
HuggingFace semantics (vocab_size = max_token_id + 1) and corrects the
get_vocab() mapping to return token → id pairs as expected.

No tokenization behavior or chunking logic is changed; this is a
compatibility and robustness fix only. The tokenizer continues to use
tiktoken for local token counting and remains model-agnostic, suitable
for OpenAI and Azure OpenAI workflows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant