-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Elasticsearch Version
v9.1.5 (previously working fine in v8.18)
Installed Plugins
No response
Java Version
bundled
OS Version
Darwin <my_macbook>.local 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6041 arm64
Problem Description
Hi all 🙌 — Chunking question here.
I was using v8.18, and everything worked fine.
After upgrading to v9.1.5, my code started failing when running bulk ingestion via helpers.bulk(es, actions) from the Python client.
Error:
{
"type": "inference_exception",
"reason": "Exception when running inference id [google_embedding_004_inference] on field [content]",
"caused_by": {
"type": "status_exception",
"reason": "Received an unsuccessful status code for request from inference entity id [google_embedding_004_inference] status [400]. Error message: [Unable to submit request because the input token count is 75239 but the model supports up to 20000. Reduce the input token count and try again. You can also use the CountTokens API to calculate prompt token count and billable characters. Learn more: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models]"
}
}Before upgrading, this exact setup worked without issue.
It now appears that the chunking configuration in the inference setup might not be respected anymore.
Could you help confirm whether there’s been a change in chunking behavior or inference request handling between 8.18 and 9.1.5? 🙏
Steps to Reproduce
Steps to Reproduce
Use the following inference endpoint configuration:
Of course this is my configuration. but basically:
- Create a semantic index.
- use Google Vertex AI endpoint with model
text-embedding-004. - chunking settings: 200 words with 50 overlap.
- try to index a long text --> it gets an error on my side.
{
"endpoints": [
{
"inference_id": "google_embedding_004_inference",
"task_type": "text_embedding",
"service": "googlevertexai",
"service_settings": {
"location": "<HIDDEN>",
"project_id": "<HIDDEN>",
"model_id": "text-embedding-004",
"dimensions": 768,
"similarity": "cosine",
"rate_limit": {
"requests_per_minute": 30000
}
},
"chunking_settings": {
"strategy": "word",
"max_chunk_size": 200,
"overlap": 50
}
}
]
}Run bulk indexing with:
from elasticsearch import helpers
helpers.bulk(es, actions)Example action item:
{
"_op_type": "index",
"_index": "v1-opinions-semantic",
"_id": 11173100,
"_source": {
"doc_id": 11173100,
"case_name": "Jesse James Clay v. the State of Texas",
"content": "<LONG OPINION TEXT HERE>"
}
}Observe the inference error returned from the embedding model.
Expected Behavior
The chunking configuration should automatically split the document text to respect the model’s maximum token limit (as in v8.18).
Actual Behavior
Despite having max_chunk_size: 200 in the inference configuration, the request to Vertex AI receives the entire document content, leading to a 400 status due to excessive token count.
Logs (if relevant)
No response