Skip to content

Hybrid chunking using OpenAI Tokenizer #260

@mohammedfaisal

Description

@mohammedfaisal

The documentation explains how to configure and use OpenAI Tokenizer with hybrid chunker in python.

       import tiktoken

        from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer

        tokenizer = OpenAITokenizer(
                   tokenizer=tiktoken.encoding_for_model("gpt-4o"),
                   max_tokens=128 * 1024,  # context window length required for OpenAI tokenizers
        )
        
        chunker = HybridChunker(
                tokenizer=tokenizer,
                merge_peers=True,  # optional, defaults to True
        )
        chunk_iter = chunker.chunk(dl_doc=doc)
        chunks = list(chunk_iter)

How to do the same operation using docling-java ?
The HybridChunkerOptions.builder().tokenizer() seems support only HuggingFace models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions