-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Description
The EmbeddingRequestChunker will batch chunks into a single request to a downstream service up to a specified batch size per service (see batching code). GoogleVertexAI has 2 limitations on request sizes, one on the number of inputs (previously limited to 5 but now limited to 250), and one on the total number of tokens across all inputs (20k tokens). These limits can be found here). In 8.19/9.1 we were asked to increase the batch size for GoogleVertexAI from 5 to 250 to reflect their new limits (see relevant PR). This change caused one user to start seeing token limit exceptions when trying to ingest large documents. This is because where they previously were sending at most 5 chunks of 250 words (~1330 tokens assuming 1 token = 0.75 words), they are now sending at most 250 chunks of 250 words (~66500 tokens).
The purpose of this issue is to add a way for users to configure their batch size through a setting on the inference endpoint (will need to decide if this is a service or a task setting). This will allow users to unblock their calls if they are hitting the token limit when attempting to ingest a large enough document into a semantic text field.