Skip to content

Conversation

@davidkyle
Copy link
Member

@davidkyle davidkyle commented Jul 30, 2025

The internal action is given an inference Id and returns the max number of words for a rerank request. Initially either 250 or 500 words is returned but the logic can be enhanced and tailored for each inference service.

A new RerankingInferenceService interface is defined to expose the window size, all services that support rerank must implement this interface. To check that this is the case all inference service unit tests now extend InferenceServiceTestCase and there is a check that if the service supports the RERANK task type then is must also implement RerankingInferenceService

Summarising the window sizes implemented in this PR

Service Model Context Window Size Tokens Context Window Size Words (0.75 tokens per word) Rerank Window Size Words
Alibaba mGTE Models 8192 6144 5500
Azure AI Studio - unknown - 300
Cohere any 4096 3072 2800
Custom Service - unknown - 300
Elasticsearch rerank 512 384 300
Google Vertex AI -003 models 512 384 300
Google Vertex AI -004 models 1024 768 600
Hugging Face - unknown - 300
Jina AI any 8000 6000 5500
SageMaker - unknown - 300
Voyage AI rerank-lite-1 4000 3000 2800
Voyage AI any other 8000 6000 5500

# Conflicts:
#	server/src/main/java/org/elasticsearch/inference/InferenceService.java
# Conflicts:
#	x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/sagemaker/SageMakerServiceTests.java
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@Override
public int rerankerWindowSize(String modelId) {
// TODO rerank chunking should use the same value
return RerankingInferenceService.CONSERVATIVE_DEFAULT_WINDOW_SIZE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an accurate value for the elastic reranker? I believe it has 512 max token count which is ~683 words assuming 0.75 tokens/word for English text.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, also when I tested snippet extraction using the highlighter the sweet spot was around 2560 characters. I worry this might be too low.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.75 tokens/word equates to roughly 384 words, the conservative default of 250 is low but it definitly avoids truncation. 300 words should be ok if not higher.

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this API so quickly! I have some questions/concerns about the defaults.

// Alibaba's mGTE models support long context windows of up to 8192 tokens.
// Using 1 token = 0.75 words, this translates to approximately 6144 words.
// https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base
return 5000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we set this so much lower than the actual token size? Is it a safety concern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked low values that definitly wouldn't truncate, but yes there is probably room to safely increase this value. 6000 is too close to the approx 6144 word limit, how about 5500?

Ultimately we may want to make this option configurable but for the best user experience something that works out of the box is required. Also as a next step this setting should be exposed as part of the endpoint configurations so users can see what the rerank chunk sizes are.

@Override
public int rerankerWindowSize(String modelId) {
// TODO rerank chunking should use the same value
return RerankingInferenceService.CONSERVATIVE_DEFAULT_WINDOW_SIZE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, also when I tested snippet extraction using the highlighter the sweet spot was around 2560 characters. I worry this might be too low.

@davidkyle davidkyle enabled auto-merge (squash) August 26, 2025 09:46
@davidkyle davidkyle merged commit 0b70308 into elastic:main Aug 26, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml/Chunking :ml Machine learning >refactoring Team:ML Meta label for the ML team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants