Error: Token indices sequence length is longer than the specified maximum sequence length for this model #1233

mariosconsta · 2025-03-24T15:17:37Z

mariosconsta
Mar 24, 2025

Hello everyone, I've been trying to figure out why I get this error/warning when I run my Pipeline. I am not even sure if this is the correct place to add this question, but here we go.

Let me give you a small background on what I want to do first:

I want to create a RAG pipeline using Milvus as a vector database and docling as a document converter while using haystack as the backend.

Here's a snippet of the code:

#IMPORT ALL THE LIBRARIES

#############################

DOCUMENTS_DIR = Path("./dummy_data")
FILES = [{"uri": file.resolve()} for file in DOCUMENTS_DIR.iterdir() if file.is_file()]
EXPORT_TYPE = ExportType.DOC_CHUNKS
EMBED_MODEL_ID = "BAAI/bge-m3"#"sentence-transformers/all-MiniLM-L6-v2" "BAAI/bge-m3"
MAX_TOKENS = 2500

# Initialize Milvus DB

document_store = MilvusDocumentStore(
    connection_args={"uri": "./milvus.db"},# Milvus Lite
    index_params={"index_type": "AUTOINDEX", "metric_type": "L2"},
    # connection_args={"uri": "http://localhost:19530"},  # Milvus standalone docker service.
    drop_old=True,
    text_field="txt"  # set for preventing conflict with same-name metadata field
)


# SETUP INDEX PIPELINE

idx_pipe = Pipeline()
idx_pipe.add_component(
    "converter",
    DoclingConverter(
        export_type=EXPORT_TYPE,
        chunker=HybridChunker(
            tokenizer=EMBED_MODEL_ID,  # instance or model name, defaults to "sentence-transformers/all-MiniLM-L6-v2"
            max_tokens=MAX_TOKENS
        )
    )
)

idx_pipe.add_component(
    "embedder",
    SentenceTransformersDocumentEmbedder()
    #OpenAIDocumentEmbedder(model="text-embedding-3-large"),
)

idx_pipe.add_component("writer", DocumentWriter(document_store=document_store))


idx_pipe.connect("converter", "embedder")

idx_pipe.connect("embedder", "writer")


# RUN THE INDEXER

print(f"Indexing {len(FILES)} files...")

pipe_out = idx_pipe.run(data={
    "converter": {"paths": list(Path(DOCUMENTS_DIR).glob("**/*"))}
    },
                        include_outputs_from="converter")
print(f'Total Chunks in Milvus: {document_store.count_documents()}')

Output

Token indices sequence length is longer than the specified maximum sequence length for this model (26838 > 8192). Running this sequence through the model will result in indexing errors
Indexing 1 files...
Batches: 100%|██████████| 1/1 [00:00<00:00,  4.28it/s]
Total Chunks in Milvus: 22

I tried both openAI embedder and HF embedder and I get the same results.

My main question is, where does it find a sequence of length 26838? Why is there a sequence of that length while my max_tokens param is at 2500? Shouldn't all chunks be at max 2500 tokens?

Also from the output, I get the sequence warning before the first print statement "Indexing 1 files..." which means that the warning comes from somewhere outside that cell?

I am very confused and I've been looking at this for days.

The tokenizer is BAAI/bge-m3, the embedder I use is text-embedding-3-large from OpenAI, both have the same Max input size.

I cannot narrow down on which step of the code I get this warning and how to solve it.

Should I use Haystack's splitter into the pipeline to split the chunks down? I am under the impression that HybridChunker handles that, and I have it setup correctly with the max_tokens parameter.

Checking the tokens of each chunk, I get this:

Chunk 1: 16 tokens
Chunk 2: 2502 tokens
Chunk 3: 2500 tokens
Chunk 4: 2501 tokens
Chunk 5: 2500 tokens
Chunk 6: 2502 tokens
Chunk 7: 2499 tokens
Chunk 8: 2502 tokens
Chunk 9: 2502 tokens
Chunk 10: 2497 tokens
Chunk 11: 2501 tokens
Chunk 12: 1854 tokens
Chunk 13: 89 tokens
Chunk 14: 56 tokens
Chunk 15: 1125 tokens
...

Sorry for the long text, thank you for reading!

EDIT:

If I run the last cell again without restarting the notebook, I do not get the sequence length warning/error. But if I restart the notebook and run the cell, I get the warning. So this error pops up only once per notebook session. Running it again and again does not make it appear.

Furtherdown I have a RAG pipeline which works, so I do not know wether do ignore the error or find a solution.

Answered by sanhuezapablo

Mar 24, 2025

You can ignore the error. Screenshot was taken from FAQ page: https://docling-project.github.io/docling/faq/

View full answer

sanhuezapablo · 2025-03-24T15:36:27Z

sanhuezapablo
Mar 24, 2025

You can ignore the error. Screenshot was taken from FAQ page: https://docling-project.github.io/docling/faq/

2 replies

mariosconsta Mar 24, 2025
Author

I checked everywhere but the FAQ.

Thank you mate, you saved me a lot of headache.

sanhuezapablo Mar 24, 2025

No problem! I remember seeing the same error and looking everywhere for it
Good luck with the project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error: Token indices sequence length is longer than the specified maximum sequence length for this model #1233

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Error: Token indices sequence length is longer than the specified maximum sequence length for this model #1233

Uh oh!

mariosconsta Mar 24, 2025

Output

Replies: 1 comment · 2 replies

Uh oh!

sanhuezapablo Mar 24, 2025

Uh oh!

mariosconsta Mar 24, 2025 Author

Uh oh!

sanhuezapablo Mar 24, 2025

mariosconsta
Mar 24, 2025

Replies: 1 comment 2 replies

sanhuezapablo
Mar 24, 2025

mariosconsta Mar 24, 2025
Author