-
Couldn't load subscription status.
- Fork 25.6k
Text similarity reranker chunks and scores snippets #133576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
kderusso
merged 49 commits into
elastic:main
from
kderusso:kderusso/text-similarity-reranking-now-with-chunks
Sep 11, 2025
Merged
Changes from all commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
79b7e72
Instead of generating snippets via highlighter, chunk and score chunk…
kderusso 9f28c08
[CI] Auto commit changes from spotless
49d25a7
Add customization based on preferred chunking settings or chunk size …
kderusso 0036271
Cleanup
kderusso 2df2f9d
Update docs/changelog/133576.yaml
kderusso ad404db
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso 8b7f7f2
Refactor/Rename SnippetScorer to MemoryIndexChunkScorer
kderusso 80f4434
PR feedback on MemoryIndexChunkScorer
kderusso 9872258
Update API and code to rename snippets to chunks
kderusso 8c4ab1e
Missed some snippet renames
kderusso fc706a8
Handle case where no matches were found to score chunks
kderusso 246dfa2
PR feedback on MemoryIndexChunkScorer, add tests
kderusso 6355282
Rename num_chunks to size
kderusso d03a0f1
Merge from main
kderusso ed13074
[CI] Auto commit changes from spotless
6a06b84
Fix error in merge
kderusso 3c695ac
Fix transport version issues after they were consolidated in main
kderusso 386f3b8
[CI] Auto commit changes from spotless
9d35f6c
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso 7131b30
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso 9c6041b
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso ed4859e
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
elasticmachine 8ac71ef
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso 5129275
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
elasticmachine 5815c9b
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso ee106a6
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso dfcefc5
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso a0cad00
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso 92a060f
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso a172b6c
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso a6c2364
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso b3f95f9
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
elasticmachine e55ccfe
Add feature flag to InferenceUpgradeTestCase
kderusso 68ea8cb
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso 68af14f
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso 37eca54
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
elasticmachine b66a58e
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso 7a4ccff
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso ca597fa
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso ef024cf
Yolo see if this fixes the test
kderusso c208845
Real fix for upgrade IT
kderusso cc1e913
[CI] Auto commit changes from spotless
2651fbd
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso d0813b2
Another ignore
kderusso d0c2139
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso 187106f
Revert "Another ignore"
kderusso af95d57
let's try reverting the renamed feature flag. If this is the cause of…
kderusso 6f8b5fe
Merge branch 'main' into kderusso/text-similarity-reranking-now-with-…
kderusso d0dd688
Remove ignored test
kderusso File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| pr: 133576 | ||
| summary: Text similarity reranker chunks and scores snippets | ||
| area: Relevance | ||
| type: enhancement | ||
| issues: [] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
98 changes: 98 additions & 0 deletions
98
...core/src/main/java/org/elasticsearch/xpack/core/common/chunks/MemoryIndexChunkScorer.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| /* | ||
| * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one | ||
| * or more contributor license agreements. Licensed under the Elastic License | ||
| * 2.0; you may not use this file except in compliance with the Elastic License | ||
| * 2.0. | ||
| */ | ||
|
|
||
| package org.elasticsearch.xpack.core.common.chunks; | ||
|
|
||
| import org.apache.lucene.analysis.standard.StandardAnalyzer; | ||
| import org.apache.lucene.document.Document; | ||
| import org.apache.lucene.document.Field; | ||
| import org.apache.lucene.document.TextField; | ||
| import org.apache.lucene.index.DirectoryReader; | ||
| import org.apache.lucene.index.IndexWriter; | ||
| import org.apache.lucene.index.IndexWriterConfig; | ||
| import org.apache.lucene.search.BooleanClause; | ||
| import org.apache.lucene.search.IndexSearcher; | ||
| import org.apache.lucene.search.Query; | ||
| import org.apache.lucene.search.ScoreDoc; | ||
| import org.apache.lucene.search.TopDocs; | ||
| import org.apache.lucene.store.ByteBuffersDirectory; | ||
| import org.apache.lucene.store.Directory; | ||
| import org.apache.lucene.util.QueryBuilder; | ||
|
|
||
| import java.io.IOException; | ||
| import java.util.ArrayList; | ||
| import java.util.List; | ||
|
|
||
| /** | ||
| * Utility class for scoring pre-determined chunks using an in-memory Lucene index. | ||
| */ | ||
| public class MemoryIndexChunkScorer { | ||
|
|
||
| private static final String CONTENT_FIELD = "content"; | ||
|
|
||
| private final StandardAnalyzer analyzer; | ||
|
|
||
| public MemoryIndexChunkScorer() { | ||
| // TODO: Allow analyzer to be customizable and/or read from the field mapping | ||
| this.analyzer = new StandardAnalyzer(); | ||
| } | ||
|
|
||
| /** | ||
| * Creates an in-memory index of chunks, or chunks, returns ordered, scored list. | ||
| * | ||
| * @param chunks the list of text chunks to score | ||
| * @param inferenceText the query text to compare against | ||
| * @param maxResults maximum number of results to return | ||
| * @return list of scored chunks ordered by relevance | ||
| * @throws IOException on failure scoring chunks | ||
| */ | ||
| public List<ScoredChunk> scoreChunks(List<String> chunks, String inferenceText, int maxResults) throws IOException { | ||
| if (chunks == null || chunks.isEmpty() || inferenceText == null || inferenceText.trim().isEmpty()) { | ||
| return new ArrayList<>(); | ||
| } | ||
|
|
||
| try (Directory directory = new ByteBuffersDirectory()) { | ||
| IndexWriterConfig config = new IndexWriterConfig(analyzer); | ||
| try (IndexWriter writer = new IndexWriter(directory, config)) { | ||
| for (String chunk : chunks) { | ||
| Document doc = new Document(); | ||
| doc.add(new TextField(CONTENT_FIELD, chunk, Field.Store.YES)); | ||
| writer.addDocument(doc); | ||
| } | ||
| writer.commit(); | ||
| } | ||
|
|
||
| try (DirectoryReader reader = DirectoryReader.open(directory)) { | ||
| IndexSearcher searcher = new IndexSearcher(reader); | ||
|
|
||
| org.apache.lucene.util.QueryBuilder qb = new QueryBuilder(analyzer); | ||
| Query query = qb.createBooleanQuery(CONTENT_FIELD, inferenceText, BooleanClause.Occur.SHOULD); | ||
| int numResults = Math.min(maxResults, chunks.size()); | ||
| TopDocs topDocs = searcher.search(query, numResults); | ||
|
|
||
| List<ScoredChunk> scoredChunks = new ArrayList<>(); | ||
| for (ScoreDoc scoreDoc : topDocs.scoreDocs) { | ||
| Document doc = reader.storedFields().document(scoreDoc.doc); | ||
| String content = doc.get(CONTENT_FIELD); | ||
| scoredChunks.add(new ScoredChunk(content, scoreDoc.score)); | ||
| } | ||
|
|
||
| // It's possible that no chunks were scorable (for example, a semantic match that does not have a lexical match). | ||
| // In this case, we'll return the first N chunks with a score of 0. | ||
| // TODO: consider parameterizing this | ||
| return scoredChunks.isEmpty() == false | ||
| ? scoredChunks | ||
| : chunks.subList(0, Math.min(maxResults, chunks.size())).stream().map(c -> new ScoredChunk(c, 0.0f)).toList(); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Represents a chunk with its relevance score. | ||
| */ | ||
| public record ScoredChunk(String content, float score) {} | ||
| } |
95 changes: 95 additions & 0 deletions
95
...src/test/java/org/elasticsearch/xpack/core/common/chunks/MemoryIndexChunkScorerTests.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| /* | ||
| * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one | ||
| * or more contributor license agreements. Licensed under the Elastic License | ||
| * 2.0; you may not use this file except in compliance with the Elastic License | ||
| * 2.0. | ||
| */ | ||
|
|
||
| package org.elasticsearch.xpack.core.common.chunks; | ||
|
|
||
| import org.elasticsearch.test.ESTestCase; | ||
|
|
||
| import java.io.IOException; | ||
| import java.util.Arrays; | ||
| import java.util.List; | ||
|
|
||
| import static org.hamcrest.Matchers.equalTo; | ||
| import static org.hamcrest.Matchers.greaterThan; | ||
|
|
||
| public class MemoryIndexChunkScorerTests extends ESTestCase { | ||
|
|
||
| private static final List<String> CHUNKS = Arrays.asList( | ||
| "Cats like to sleep all day and play with mice", | ||
| "Dogs are loyal companions and great pets", | ||
| "The weather today is very sunny and warm", | ||
| "Dogs love to play with toys and go for walks", | ||
| "Elasticsearch is a great search engine" | ||
| ); | ||
|
|
||
| public void testScoreChunks() throws IOException { | ||
| MemoryIndexChunkScorer scorer = new MemoryIndexChunkScorer(); | ||
|
|
||
| String inferenceText = "dogs play walk"; | ||
| int maxResults = 3; | ||
|
|
||
| List<MemoryIndexChunkScorer.ScoredChunk> scoredChunks = scorer.scoreChunks(CHUNKS, inferenceText, maxResults); | ||
|
|
||
| assertEquals(maxResults, scoredChunks.size()); | ||
|
|
||
| // The chunks about dogs should score highest, followed by the chunk about cats | ||
| MemoryIndexChunkScorer.ScoredChunk chunk = scoredChunks.getFirst(); | ||
| assertTrue(chunk.content().equalsIgnoreCase("Dogs love to play with toys and go for walks")); | ||
| assertThat(chunk.score(), greaterThan(0f)); | ||
|
|
||
| chunk = scoredChunks.get(1); | ||
| assertTrue(chunk.content().equalsIgnoreCase("Dogs are loyal companions and great pets")); | ||
| assertThat(chunk.score(), greaterThan(0f)); | ||
|
|
||
| chunk = scoredChunks.get(2); | ||
| assertTrue(chunk.content().equalsIgnoreCase("Cats like to sleep all day and play with mice")); | ||
| assertThat(chunk.score(), greaterThan(0f)); | ||
|
|
||
| // Scores should be in descending order | ||
| for (int i = 1; i < scoredChunks.size(); i++) { | ||
| assertTrue(scoredChunks.get(i - 1).score() >= scoredChunks.get(i).score()); | ||
| } | ||
| } | ||
|
|
||
| public void testEmptyChunks() throws IOException { | ||
|
|
||
| int maxResults = 3; | ||
|
|
||
| MemoryIndexChunkScorer scorer = new MemoryIndexChunkScorer(); | ||
|
|
||
| // Zero results | ||
| List<MemoryIndexChunkScorer.ScoredChunk> scoredChunks = scorer.scoreChunks(CHUNKS, "puggles", maxResults); | ||
| assertEquals(maxResults, scoredChunks.size()); | ||
|
|
||
| // There were no results so we return the first N chunks in order | ||
| MemoryIndexChunkScorer.ScoredChunk chunk = scoredChunks.getFirst(); | ||
| assertTrue(chunk.content().equalsIgnoreCase("Cats like to sleep all day and play with mice")); | ||
| assertThat(chunk.score(), equalTo(0f)); | ||
|
|
||
| chunk = scoredChunks.get(1); | ||
| assertTrue(chunk.content().equalsIgnoreCase("Dogs are loyal companions and great pets")); | ||
| assertThat(chunk.score(), equalTo(0f)); | ||
|
|
||
| chunk = scoredChunks.get(2); | ||
| assertTrue(chunk.content().equalsIgnoreCase("The weather today is very sunny and warm")); | ||
| assertThat(chunk.score(), equalTo(0f)); | ||
|
|
||
| // Null and Empty chunk input | ||
| scoredChunks = scorer.scoreChunks(List.of(), "puggles", maxResults); | ||
| assertTrue(scoredChunks.isEmpty()); | ||
|
|
||
| scoredChunks = scorer.scoreChunks(CHUNKS, "", maxResults); | ||
| assertTrue(scoredChunks.isEmpty()); | ||
|
|
||
| scoredChunks = scorer.scoreChunks(null, "puggles", maxResults); | ||
| assertTrue(scoredChunks.isEmpty()); | ||
|
|
||
| scoredChunks = scorer.scoreChunks(CHUNKS, null, maxResults); | ||
| assertTrue(scoredChunks.isEmpty()); | ||
| } | ||
|
|
||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
100 changes: 100 additions & 0 deletions
100
...rc/main/java/org/elasticsearch/xpack/inference/rank/textsimilarity/ChunkScorerConfig.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| /* | ||
| * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one | ||
| * or more contributor license agreements. Licensed under the Elastic License | ||
| * 2.0; you may not use this file except in compliance with the Elastic License | ||
| * 2.0. | ||
| */ | ||
|
|
||
| package org.elasticsearch.xpack.inference.rank.textsimilarity; | ||
|
|
||
| import org.elasticsearch.common.io.stream.StreamInput; | ||
| import org.elasticsearch.common.io.stream.StreamOutput; | ||
| import org.elasticsearch.common.io.stream.Writeable; | ||
| import org.elasticsearch.inference.ChunkingSettings; | ||
| import org.elasticsearch.xpack.inference.chunking.ChunkingSettingsBuilder; | ||
| import org.elasticsearch.xpack.inference.chunking.SentenceBoundaryChunkingSettings; | ||
|
|
||
| import java.io.IOException; | ||
| import java.util.Map; | ||
| import java.util.Objects; | ||
|
|
||
| public class ChunkScorerConfig implements Writeable { | ||
|
|
||
| public final Integer size; | ||
| private final String inferenceText; | ||
| private final ChunkingSettings chunkingSettings; | ||
|
|
||
| public static final int DEFAULT_CHUNK_SIZE = 300; | ||
| public static final int DEFAULT_SIZE = 1; | ||
|
|
||
| public static ChunkingSettings createChunkingSettings(Integer chunkSize) { | ||
| int chunkSizeOrDefault = chunkSize != null ? chunkSize : DEFAULT_CHUNK_SIZE; | ||
| ChunkingSettings chunkingSettings = new SentenceBoundaryChunkingSettings(chunkSizeOrDefault, 0); | ||
| chunkingSettings.validate(); | ||
| return chunkingSettings; | ||
| } | ||
|
|
||
| public static ChunkingSettings chunkingSettingsFromMap(Map<String, Object> map) { | ||
|
|
||
| if (map == null || map.isEmpty()) { | ||
| return createChunkingSettings(DEFAULT_CHUNK_SIZE); | ||
| } | ||
|
|
||
| if (map.size() == 1 && map.containsKey("max_chunk_size")) { | ||
| return createChunkingSettings((Integer) map.get("max_chunk_size")); | ||
| } | ||
|
|
||
| return ChunkingSettingsBuilder.fromMap(map); | ||
| } | ||
|
|
||
| public ChunkScorerConfig(StreamInput in) throws IOException { | ||
| this.size = in.readOptionalVInt(); | ||
| this.inferenceText = in.readString(); | ||
| Map<String, Object> chunkingSettingsMap = in.readGenericMap(); | ||
| this.chunkingSettings = ChunkingSettingsBuilder.fromMap(chunkingSettingsMap); | ||
| } | ||
|
|
||
| public ChunkScorerConfig(Integer size, ChunkingSettings chunkingSettings) { | ||
| this(size, null, chunkingSettings); | ||
| } | ||
|
|
||
| public ChunkScorerConfig(Integer size, String inferenceText, ChunkingSettings chunkingSettings) { | ||
| this.size = size; | ||
| this.inferenceText = inferenceText; | ||
| this.chunkingSettings = chunkingSettings; | ||
| } | ||
|
|
||
| @Override | ||
| public void writeTo(StreamOutput out) throws IOException { | ||
| out.writeOptionalVInt(size); | ||
| out.writeString(inferenceText); | ||
| out.writeGenericMap(chunkingSettings.asMap()); | ||
| } | ||
|
|
||
| public Integer size() { | ||
| return size; | ||
| } | ||
|
|
||
| public String inferenceText() { | ||
| return inferenceText; | ||
| } | ||
|
|
||
| public ChunkingSettings chunkingSettings() { | ||
| return chunkingSettings; | ||
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object o) { | ||
| if (this == o) return true; | ||
| if (o == null || getClass() != o.getClass()) return false; | ||
| ChunkScorerConfig that = (ChunkScorerConfig) o; | ||
| return Objects.equals(size, that.size) | ||
| && Objects.equals(inferenceText, that.inferenceText) | ||
| && Objects.equals(chunkingSettings, that.chunkingSettings); | ||
| } | ||
|
|
||
| @Override | ||
| public int hashCode() { | ||
| return Objects.hash(size, inferenceText, chunkingSettings); | ||
| } | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.