Skip to content

[ML] Add configurable batch size to GoogleVertexAI to avoid hitting token limit during chunked inference #137288

@dan-rubinstein

Description

@dan-rubinstein

Description

The EmbeddingRequestChunker will batch chunks into a single request to a downstream service up to a specified batch size per service (see batching code). GoogleVertexAI has 2 limitations on request sizes, one on the number of inputs (previously limited to 5 but now limited to 250), and one on the total number of tokens across all inputs (20k tokens). These limits can be found here). In 8.19/9.1 we were asked to increase the batch size for GoogleVertexAI from 5 to 250 to reflect their new limits (see relevant PR). This change caused one user to start seeing token limit exceptions when trying to ingest large documents. This is because where they previously were sending at most 5 chunks of 250 words (~1330 tokens assuming 1 token = 0.75 words), they are now sending at most 250 chunks of 250 words (~66500 tokens).

The purpose of this issue is to add a way for users to configure their batch size through a setting on the inference endpoint (will need to decide if this is a service or a task setting). This will allow users to unblock their calls if they are hitting the token limit when attempting to ingest a large enough document into a semantic text field.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions