Error: Token indices sequence length is longer than the specified maximum sequence length for this model #1233
-
Hello everyone, I've been trying to figure out why I get this error/warning when I run my Pipeline. I am not even sure if this is the correct place to add this question, but here we go. Let me give you a small background on what I want to do first: I want to create a RAG pipeline using Milvus as a vector database and docling as a document converter while using haystack as the backend. Here's a snippet of the code: #IMPORT ALL THE LIBRARIES
#############################
DOCUMENTS_DIR = Path("./dummy_data")
FILES = [{"uri": file.resolve()} for file in DOCUMENTS_DIR.iterdir() if file.is_file()]
EXPORT_TYPE = ExportType.DOC_CHUNKS
EMBED_MODEL_ID = "BAAI/bge-m3"#"sentence-transformers/all-MiniLM-L6-v2" "BAAI/bge-m3"
MAX_TOKENS = 2500
# Initialize Milvus DB
document_store = MilvusDocumentStore(
connection_args={"uri": "./milvus.db"},# Milvus Lite
index_params={"index_type": "AUTOINDEX", "metric_type": "L2"},
# connection_args={"uri": "http://localhost:19530"}, # Milvus standalone docker service.
drop_old=True,
text_field="txt" # set for preventing conflict with same-name metadata field
)
# SETUP INDEX PIPELINE
idx_pipe = Pipeline()
idx_pipe.add_component(
"converter",
DoclingConverter(
export_type=EXPORT_TYPE,
chunker=HybridChunker(
tokenizer=EMBED_MODEL_ID, # instance or model name, defaults to "sentence-transformers/all-MiniLM-L6-v2"
max_tokens=MAX_TOKENS
)
)
)
idx_pipe.add_component(
"embedder",
SentenceTransformersDocumentEmbedder()
#OpenAIDocumentEmbedder(model="text-embedding-3-large"),
)
idx_pipe.add_component("writer", DocumentWriter(document_store=document_store))
idx_pipe.connect("converter", "embedder")
idx_pipe.connect("embedder", "writer")
# RUN THE INDEXER
print(f"Indexing {len(FILES)} files...")
pipe_out = idx_pipe.run(data={
"converter": {"paths": list(Path(DOCUMENTS_DIR).glob("**/*"))}
},
include_outputs_from="converter")
print(f'Total Chunks in Milvus: {document_store.count_documents()}') Output
I tried both openAI embedder and HF embedder and I get the same results. My main question is, where does it find a sequence of length 26838? Why is there a sequence of that length while my max_tokens param is at 2500? Shouldn't all chunks be at max 2500 tokens? Also from the output, I get the sequence warning before the first print statement "Indexing 1 files..." which means that the warning comes from somewhere outside that cell? I am very confused and I've been looking at this for days. The tokenizer is BAAI/bge-m3, the embedder I use is text-embedding-3-large from OpenAI, both have the same Max input size. I cannot narrow down on which step of the code I get this warning and how to solve it. Should I use Haystack's splitter into the pipeline to split the chunks down? I am under the impression that HybridChunker handles that, and I have it setup correctly with the max_tokens parameter. Checking the tokens of each chunk, I get this:
Sorry for the long text, thank you for reading! EDIT: If I run the last cell again without restarting the notebook, I do not get the sequence length warning/error. But if I restart the notebook and run the cell, I get the warning. So this error pops up only once per notebook session. Running it again and again does not make it appear. Furtherdown I have a RAG pipeline which works, so I do not know wether do ignore the error or find a solution. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
You can ignore the error. Screenshot was taken from FAQ page: https://docling-project.github.io/docling/faq/ |
Beta Was this translation helpful? Give feedback.
You can ignore the error. Screenshot was taken from FAQ page: https://docling-project.github.io/docling/faq/