Regression after upgrade to v9.1.5: Inference fails due to input token limit despite chunking configuration

### Elasticsearch Version

v9.1.5 (previously working fine in v8.18)

### Installed Plugins

_No response_

### Java Version

_bundled_

### OS Version

Darwin <my_macbook>.local 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6041 arm64

### Problem Description

Hi all 🙌 — Chunking question here.

I was using v8.18, and everything worked fine.
After upgrading to v9.1.5, my code started failing when running bulk ingestion via helpers.bulk(es, actions) from the Python client.

Error:

```json
{
    "type": "inference_exception",
    "reason": "Exception when running inference id [google_embedding_004_inference] on field [content]",
    "caused_by": {
        "type": "status_exception",
        "reason": "Received an unsuccessful status code for request from inference entity id [google_embedding_004_inference] status [400]. Error message: [Unable to submit request because the input token count is 75239 but the model supports up to 20000. Reduce the input token count and try again. You can also use the CountTokens API to calculate prompt token count and billable characters. Learn more: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models]"
    }
}
```


Before upgrading, this exact setup worked without issue.
It now appears that the chunking configuration in the inference setup might not be respected anymore.

Could you help confirm whether there’s been a change in chunking behavior or inference request handling between 8.18 and 9.1.5? 🙏

### Steps to Reproduce

Steps to Reproduce

Use the following inference endpoint configuration: 
Of course this is my configuration. but basically:
- Create a semantic index.
- use Google Vertex AI endpoint with model `text-embedding-004`.
- chunking settings: 200 words with 50 overlap.
- try to index a long text --> it gets an error on my side.

```json
{
  "endpoints": [
    {
      "inference_id": "google_embedding_004_inference",
      "task_type": "text_embedding",
      "service": "googlevertexai",
      "service_settings": {
        "location": "<HIDDEN>",
        "project_id": "<HIDDEN>",
        "model_id": "text-embedding-004",
        "dimensions": 768,
        "similarity": "cosine",
        "rate_limit": {
          "requests_per_minute": 30000
        }
      },
      "chunking_settings": {
        "strategy": "word",
        "max_chunk_size": 200,
        "overlap": 50
      }
    }
  ]
}
```


Run bulk indexing with:

```python
from elasticsearch import helpers

helpers.bulk(es, actions)
```


Example action item:

```json
{
  "_op_type": "index",
  "_index": "v1-opinions-semantic",
  "_id": 11173100,
  "_source": {
    "doc_id": 11173100,
    "case_name": "Jesse James Clay v. the State of Texas",
    "content": "<LONG OPINION TEXT HERE>"
  }
}
```


Observe the inference error returned from the embedding model.

---

### Expected Behavior

The chunking configuration should automatically split the document text to respect the model’s maximum token limit (as in v8.18).

### Actual Behavior

Despite having max_chunk_size: 200 in the inference configuration, the request to Vertex AI receives the entire document content, leading to a 400 status due to excessive token count.

### Logs (if relevant)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regression after upgrade to v9.1.5: Inference fails due to input token limit despite chunking configuration #137194

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Logs (if relevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regression after upgrade to v9.1.5: Inference fails due to input token limit despite chunking configuration #137194

Description

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Logs (if relevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions