- 
                Notifications
    
You must be signed in to change notification settings  - Fork 25.6k
 
Remove matched text from chunks #123607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove matched text from chunks #123607
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the changes make sense to me. just a couple questions.
        
          
                ...core/src/main/java/org/elasticsearch/xpack/core/inference/results/ChunkedInferenceError.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioServiceTests.java
          
            Show resolved
            Hide resolved
        
      | 
           @elasticmachine update branch  | 
    
| 
           Pinging @elastic/ml-core (Team:ML)  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provided the serverless failures are not related, this looks ready to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| 
           On second thought: I'm also fine with merging this, and I'll rebase my PR on top of this. It's a bit of effort, but that reduces the complexity of my PR.  | 
    
| assertThat(chunkedFloatResult.chunks().get(3).matchedText(), startsWith(" passage_input60 ")); | ||
| assertThat(chunkedFloatResult.chunks().get(4).matchedText(), startsWith(" passage_input80 ")); | ||
| assertThat(chunkedFloatResult.chunks().get(5).matchedText(), startsWith(" passage_input100 ")); | ||
| assertThat(chunkedFloatResult.chunks().get(0).offset(), equalTo(new ChunkedInference.TextOffset(0, 309))); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find to verify that these are the correct assertions.
I'd prefer something like:
assertThat(getMatchedText(inputs.get(1), chunkedFloatResult.chunks().get(5).offset()), startsWith(" passage_input100 "));
together with a simple helper function
private static String getMatchedText(String text, ChunkedInference.TextOffset offset) {
    return text.substring(offset.start(), offset.end());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one small comment about the test assertions
| 
           buildkite test this  | 
    
          💔 Backport failed
 You can use sqren/backport to manually backport by running   | 
    
          💚 All backports created successfully
 Questions ?Please refer to the Backport tool documentation  | 
    
(cherry picked from commit 2fa6651) # Conflicts: # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/amazonbedrock/AmazonBedrockServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/azureopenai/AzureOpenAiServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/cohere/CohereServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/openai/OpenAiServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/voyageai/VoyageAIServiceTests.java
(cherry picked from commit 2fa6651) # Conflicts: # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/amazonbedrock/AmazonBedrockServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/azureopenai/AzureOpenAiServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/cohere/CohereServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/openai/OpenAiServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/voyageai/VoyageAIServiceTests.java
Remove the
matchedTextfield from chunks. This is no longer required because we build the chunk text on demand using offsets. Removing this field should help with OOMs when chunking large documents.