Skip to content

Regression after upgrade to v9.1.5: Inference fails due to input token limit despite chunking configuration #137194

@yfenes

Description

@yfenes

Elasticsearch Version

v9.1.5 (previously working fine in v8.18)

Installed Plugins

No response

Java Version

bundled

OS Version

Darwin <my_macbook>.local 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6041 arm64

Problem Description

Hi all 🙌 — Chunking question here.

I was using v8.18, and everything worked fine.
After upgrading to v9.1.5, my code started failing when running bulk ingestion via helpers.bulk(es, actions) from the Python client.

Error:

{
    "type": "inference_exception",
    "reason": "Exception when running inference id [google_embedding_004_inference] on field [content]",
    "caused_by": {
        "type": "status_exception",
        "reason": "Received an unsuccessful status code for request from inference entity id [google_embedding_004_inference] status [400]. Error message: [Unable to submit request because the input token count is 75239 but the model supports up to 20000. Reduce the input token count and try again. You can also use the CountTokens API to calculate prompt token count and billable characters. Learn more: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models]"
    }
}

Before upgrading, this exact setup worked without issue.
It now appears that the chunking configuration in the inference setup might not be respected anymore.

Could you help confirm whether there’s been a change in chunking behavior or inference request handling between 8.18 and 9.1.5? 🙏

Steps to Reproduce

Steps to Reproduce

Use the following inference endpoint configuration:
Of course this is my configuration. but basically:

  • Create a semantic index.
  • use Google Vertex AI endpoint with model text-embedding-004.
  • chunking settings: 200 words with 50 overlap.
  • try to index a long text --> it gets an error on my side.
{
  "endpoints": [
    {
      "inference_id": "google_embedding_004_inference",
      "task_type": "text_embedding",
      "service": "googlevertexai",
      "service_settings": {
        "location": "<HIDDEN>",
        "project_id": "<HIDDEN>",
        "model_id": "text-embedding-004",
        "dimensions": 768,
        "similarity": "cosine",
        "rate_limit": {
          "requests_per_minute": 30000
        }
      },
      "chunking_settings": {
        "strategy": "word",
        "max_chunk_size": 200,
        "overlap": 50
      }
    }
  ]
}

Run bulk indexing with:

from elasticsearch import helpers

helpers.bulk(es, actions)

Example action item:

{
  "_op_type": "index",
  "_index": "v1-opinions-semantic",
  "_id": 11173100,
  "_source": {
    "doc_id": 11173100,
    "case_name": "Jesse James Clay v. the State of Texas",
    "content": "<LONG OPINION TEXT HERE>"
  }
}

Observe the inference error returned from the embedding model.


Expected Behavior

The chunking configuration should automatically split the document text to respect the model’s maximum token limit (as in v8.18).

Actual Behavior

Despite having max_chunk_size: 200 in the inference configuration, the request to Vertex AI receives the entire document content, leading to a 400 status due to excessive token count.

Logs (if relevant)

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions