Skip to content

Commit 3471987

Browse files
authored
Add support for retrieving semantic_text's indexed chunks via fields API (#132410)
Introduces the "format": "chunks" option for the fields parameter in _search requests. Allows users to retrieve the original text chunks generated by a semantic field’s chunking strategy. Example usage: ``` POST test-index/_search { "query": { "ids" : { "values" : ["1"] } }, "fields": [ { "field": "semantic_text_field", "format": "chunks" <1> } ] } ```
1 parent 0d7a2cc commit 3471987

File tree

6 files changed

+219
-45
lines changed

6 files changed

+219
-45
lines changed

docs/changelog/132410.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 132410
2+
summary: Add support for retrieving semantic_text's indexed chunks via fields API
3+
area: Vector Search
4+
type: feature
5+
issues: []

docs/reference/elasticsearch/mapping-reference/semantic-text.md

Lines changed: 53 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -282,6 +282,34 @@ PUT test-index/_doc/1
282282
* Others (such as `elastic` and `elasticsearch`) will automatically truncate
283283
the input.
284284

285+
## Retrieving indexed chunks
286+
```{applies_to}
287+
stack: ga 9.2
288+
serverless: ga
289+
```
290+
291+
You can retrieve the individual chunks generated by your semantic field’s chunking
292+
strategy using the [fields parameter](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#search-fields-param):
293+
294+
```console
295+
POST test-index/_search
296+
{
297+
"query": {
298+
"ids" : {
299+
"values" : ["1"]
300+
}
301+
},
302+
"fields": [
303+
{
304+
"field": "semantic_text_field",
305+
"format": "chunks" <1>
306+
}
307+
]
308+
}
309+
```
310+
311+
1. Use `"format": "chunks"` to return the field’s text as the original text chunks that were indexed.
312+
285313
## Extracting relevant fragments from semantic text [semantic-text-highlighting]
286314

287315
You can extract the most relevant fragments from a semantic text field by using
@@ -311,27 +339,6 @@ POST test-index/_search
311339
2. Sorts the most relevant highlighted fragments by score when set to `score`. By default,
312340
fragments will be output in the order they appear in the field (order: none).
313341

314-
To use the `semantic` highlighter to view chunks in the order which they were indexed with no scoring,
315-
use the `match_all` query to retrieve them in the order they appear in the document:
316-
317-
```console
318-
POST test-index/_search
319-
{
320-
"query": {
321-
"match_all": {}
322-
},
323-
"highlight": {
324-
"fields": {
325-
"my_semantic_field": {
326-
"number_of_fragments": 5 <1>
327-
}
328-
}
329-
}
330-
}
331-
```
332-
333-
1. This will return the first 5 chunks, set this number higher to retrieve more chunks.
334-
335342
Highlighting is supported on fields other than semantic_text. However, if you
336343
want to restrict highlighting to the semantic highlighter and return no
337344
fragments when the field is not of type semantic_text, you can explicitly
@@ -359,6 +366,31 @@ PUT test-index
359366

360367
1. Ensures that highlighting is applied exclusively to semantic_text fields.
361368

369+
To retrieve all fragments from the `semantic` highlighter in their original indexing order
370+
without scoring, use a `match_all` query as the `highlight_query`.
371+
This ensures fragments are returned in the order they appear in the document:
372+
373+
```console
374+
POST test-index/_search
375+
{
376+
"query": {
377+
"ids": {
378+
"values": ["1"]
379+
}
380+
},
381+
"highlight": {
382+
"fields": {
383+
"my_semantic_field": {
384+
"number_of_fragments": 5, <1>
385+
"highlight_query": { "match_all": {} }
386+
}
387+
}
388+
}
389+
}
390+
```
391+
392+
1. Returns the first 5 fragments. Increase this value to retrieve additional fragments.
393+
362394
## Updates and partial updates for `semantic_text` fields [semantic-text-updates]
363395

364396
When updating documents that contain `semantic_text` fields, it’s important to understand how inference is triggered:

x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceFeatures.java

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ public class InferenceFeatures implements FeatureSpecification {
4747
private static final NodeFeature SEMANTIC_TEXT_MATCH_ALL_HIGHLIGHTER = new NodeFeature("semantic_text.match_all_highlighter");
4848
private static final NodeFeature COHERE_V2_API = new NodeFeature("inference.cohere.v2");
4949
public static final NodeFeature SEMANTIC_TEXT_HIGHLIGHTING_FLAT = new NodeFeature("semantic_text.highlighter.flat_index_options");
50+
private static final NodeFeature SEMANTIC_TEXT_FIELDS_CHUNKS_FORMAT = new NodeFeature("semantic_text.fields_chunks_format");
5051

5152
@Override
5253
public Set<NodeFeature> getTestFeatures() {
@@ -80,7 +81,8 @@ public Set<NodeFeature> getTestFeatures() {
8081
SEMANTIC_TEXT_INDEX_OPTIONS_WITH_DEFAULTS,
8182
SEMANTIC_QUERY_REWRITE_INTERCEPTORS_PROPAGATE_BOOST_AND_QUERY_NAME_FIX,
8283
SEMANTIC_TEXT_HIGHLIGHTING_FLAT,
83-
SEMANTIC_TEXT_SPARSE_VECTOR_INDEX_OPTIONS
84+
SEMANTIC_TEXT_SPARSE_VECTOR_INDEX_OPTIONS,
85+
SEMANTIC_TEXT_FIELDS_CHUNKS_FORMAT
8486
)
8587
);
8688
if (RERANK_SNIPPETS.isEnabled()) {

x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/mapper/SemanticInferenceMetadataFieldsMapper.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ public ValueFetcher valueFetcher(MappingLookup mappingLookup, Function<Query, Bi
6666
for (var inferenceField : mappingLookup.inferenceFields().keySet()) {
6767
MappedFieldType ft = mappingLookup.getFieldType(inferenceField);
6868
if (ft instanceof SemanticTextFieldMapper.SemanticTextFieldType semanticTextFieldType) {
69-
fieldFetchers.put(inferenceField, semanticTextFieldType.valueFetcherWithInferenceResults(bitSetCache, searcher));
69+
fieldFetchers.put(inferenceField, semanticTextFieldType.valueFetcherWithInferenceResults(bitSetCache, searcher, false));
7070
} else {
7171
throw new IllegalArgumentException(
7272
"Invalid inference field [" + ft.name() + "]. Expected field type [semantic_text] but got [" + ft.typeName() + "]"

x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/mapper/SemanticTextFieldMapper.java

Lines changed: 84 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@
8989
import java.io.UncheckedIOException;
9090
import java.util.ArrayList;
9191
import java.util.Arrays;
92+
import java.util.HashMap;
9293
import java.util.Iterator;
9394
import java.util.LinkedHashMap;
9495
import java.util.List;
@@ -104,6 +105,7 @@
104105
import static org.elasticsearch.index.IndexVersions.SEMANTIC_TEXT_DEFAULTS_TO_BBQ_BACKPORT_8_X;
105106
import static org.elasticsearch.inference.TaskType.SPARSE_EMBEDDING;
106107
import static org.elasticsearch.inference.TaskType.TEXT_EMBEDDING;
108+
import static org.elasticsearch.lucene.search.uhighlight.CustomUnifiedHighlighter.MULTIVAL_SEP_CHAR;
107109
import static org.elasticsearch.search.SearchService.DEFAULT_SIZE;
108110
import static org.elasticsearch.xpack.inference.mapper.SemanticTextField.CHUNKED_EMBEDDINGS_FIELD;
109111
import static org.elasticsearch.xpack.inference.mapper.SemanticTextField.CHUNKED_OFFSET_FIELD;
@@ -864,14 +866,26 @@ public Query existsQuery(SearchExecutionContext context) {
864866

865867
@Override
866868
public ValueFetcher valueFetcher(SearchExecutionContext context, String format) {
869+
if (format != null && "chunks".equals(format) == false) {
870+
throw new IllegalArgumentException(
871+
"Unknown format [" + format + "] for field [" + name() + "], only [chunks] is supported."
872+
);
873+
}
874+
if (format != null) {
875+
return valueFetcherWithInferenceResults(getChunksField().bitsetProducer(), context.searcher(), true);
876+
}
867877
if (useLegacyFormat) {
868878
// Redirect the fetcher to load the original values of the field
869879
return SourceValueFetcher.toString(getOriginalTextFieldName(name()), context, format);
870880
}
871881
return SourceValueFetcher.toString(name(), context, null);
872882
}
873883

874-
ValueFetcher valueFetcherWithInferenceResults(Function<Query, BitSetProducer> bitSetCache, IndexSearcher searcher) {
884+
ValueFetcher valueFetcherWithInferenceResults(
885+
Function<Query, BitSetProducer> bitSetCache,
886+
IndexSearcher searcher,
887+
boolean onlyTextChunks
888+
) {
875889
var embeddingsField = getEmbeddingsField();
876890
if (embeddingsField == null) {
877891
return ValueFetcher.EMPTY;
@@ -884,7 +898,7 @@ ValueFetcher valueFetcherWithInferenceResults(Function<Query, BitSetProducer> bi
884898
org.apache.lucene.search.ScoreMode.COMPLETE_NO_SCORES,
885899
1
886900
);
887-
return new SemanticTextFieldValueFetcher(bitSetFilter, childWeight, embeddingsLoader);
901+
return new SemanticTextFieldValueFetcher(bitSetFilter, childWeight, embeddingsLoader, onlyTextChunks);
888902
} catch (IOException exc) {
889903
throw new UncheckedIOException(exc);
890904
}
@@ -1022,6 +1036,7 @@ private class SemanticTextFieldValueFetcher implements ValueFetcher {
10221036
private final BitSetProducer parentBitSetProducer;
10231037
private final Weight childWeight;
10241038
private final SourceLoader.SyntheticFieldLoader fieldLoader;
1039+
private final boolean onlyTextChunks;
10251040

10261041
private BitSet bitSet;
10271042
private Scorer childScorer;
@@ -1031,11 +1046,13 @@ private class SemanticTextFieldValueFetcher implements ValueFetcher {
10311046
private SemanticTextFieldValueFetcher(
10321047
BitSetProducer bitSetProducer,
10331048
Weight childWeight,
1034-
SourceLoader.SyntheticFieldLoader fieldLoader
1049+
SourceLoader.SyntheticFieldLoader fieldLoader,
1050+
boolean onlyTextChunks
10351051
) {
10361052
this.parentBitSetProducer = bitSetProducer;
10371053
this.childWeight = childWeight;
10381054
this.fieldLoader = fieldLoader;
1055+
this.onlyTextChunks = onlyTextChunks;
10391056
}
10401057

10411058
@Override
@@ -1046,7 +1063,9 @@ public void setNextReader(LeafReaderContext context) {
10461063
if (childScorer != null) {
10471064
childScorer.iterator().nextDoc();
10481065
}
1049-
dvLoader = fieldLoader.docValuesLoader(context.reader(), null);
1066+
if (onlyTextChunks == false) {
1067+
dvLoader = fieldLoader.docValuesLoader(context.reader(), null);
1068+
}
10501069
var terms = context.reader().terms(getOffsetsFieldName(name()));
10511070
offsetsLoader = terms != null ? OffsetSourceField.loader(terms) : null;
10521071
} catch (IOException exc) {
@@ -1064,35 +1083,46 @@ public List<Object> fetchValues(Source source, int doc, List<Object> ignoredValu
10641083
if (it.docID() < previousParent) {
10651084
it.advance(previousParent);
10661085
}
1086+
1087+
return onlyTextChunks ? fetchTextChunks(source, doc, it) : fetchFullField(source, doc, it);
1088+
}
1089+
1090+
private List<Object> fetchTextChunks(Source source, int doc, DocIdSetIterator it) throws IOException {
1091+
Map<String, String> originalValueMap = new HashMap<>();
1092+
List<Object> chunks = new ArrayList<>();
1093+
1094+
iterateChildDocs(doc, it, offset -> {
1095+
var rawValue = originalValueMap.computeIfAbsent(offset.field(), k -> {
1096+
var valueObj = XContentMapValues.extractValue(offset.field(), source.source(), null);
1097+
var values = SemanticTextUtils.nodeStringValues(offset.field(), valueObj).stream().toList();
1098+
return Strings.collectionToDelimitedString(values, String.valueOf(MULTIVAL_SEP_CHAR));
1099+
});
1100+
1101+
chunks.add(rawValue.substring(offset.start(), offset.end()));
1102+
});
1103+
1104+
return chunks;
1105+
}
1106+
1107+
private List<Object> fetchFullField(Source source, int doc, DocIdSetIterator it) throws IOException {
10671108
Map<String, List<SemanticTextField.Chunk>> chunkMap = new LinkedHashMap<>();
1068-
while (it.docID() < doc) {
1069-
if (dvLoader == null || dvLoader.advanceToDoc(it.docID()) == false) {
1070-
throw new IllegalStateException(
1071-
"Cannot fetch values for field [" + name() + "], missing embeddings for doc [" + doc + "]"
1072-
);
1073-
}
1074-
var offset = offsetsLoader.advanceTo(it.docID());
1075-
if (offset == null) {
1076-
throw new IllegalStateException(
1077-
"Cannot fetch values for field [" + name() + "], missing offsets for doc [" + doc + "]"
1078-
);
1079-
}
1080-
var chunks = chunkMap.computeIfAbsent(offset.field(), k -> new ArrayList<>());
1081-
chunks.add(
1109+
1110+
iterateChildDocs(doc, it, offset -> {
1111+
var fullChunks = chunkMap.computeIfAbsent(offset.field(), k -> new ArrayList<>());
1112+
fullChunks.add(
10821113
new SemanticTextField.Chunk(
10831114
null,
10841115
offset.start(),
10851116
offset.end(),
10861117
rawEmbeddings(fieldLoader::write, source.sourceContentType())
10871118
)
10881119
);
1089-
if (it.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
1090-
break;
1091-
}
1092-
}
1120+
});
1121+
10931122
if (chunkMap.isEmpty()) {
10941123
return List.of();
10951124
}
1125+
10961126
return List.of(
10971127
new SemanticTextField(
10981128
useLegacyFormat,
@@ -1104,6 +1134,38 @@ public List<Object> fetchValues(Source source, int doc, List<Object> ignoredValu
11041134
);
11051135
}
11061136

1137+
/**
1138+
* Iterates over all child documents for the given doc and applies the provided action for each valid offset.
1139+
*/
1140+
private void iterateChildDocs(
1141+
int doc,
1142+
DocIdSetIterator it,
1143+
CheckedConsumer<OffsetSourceFieldMapper.OffsetSource, IOException> action
1144+
) throws IOException {
1145+
while (it.docID() < doc) {
1146+
if (onlyTextChunks == false) {
1147+
if (dvLoader == null || dvLoader.advanceToDoc(it.docID()) == false) {
1148+
throw new IllegalStateException(
1149+
"Cannot fetch values for field [" + name() + "], missing embeddings for doc [" + doc + "]"
1150+
);
1151+
}
1152+
}
1153+
1154+
var offset = offsetsLoader.advanceTo(it.docID());
1155+
if (offset == null) {
1156+
throw new IllegalStateException(
1157+
"Cannot fetch values for field [" + name() + "], missing offsets for doc [" + doc + "]"
1158+
);
1159+
}
1160+
1161+
action.accept(offset);
1162+
1163+
if (it.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {
1164+
break;
1165+
}
1166+
}
1167+
}
1168+
11071169
private BytesReference rawEmbeddings(CheckedConsumer<XContentBuilder, IOException> writer, XContentType xContentType)
11081170
throws IOException {
11091171
try (var result = XContentFactory.contentBuilder(xContentType)) {

x-pack/plugin/inference/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -671,3 +671,76 @@ setup:
671671
- length: { hits.hits.0.highlight.bbq_hnsw_field: 1 }
672672
- match: { hits.hits.0.highlight.bbq_hnsw_field.0: "ElasticSearch is an open source, distributed, RESTful, search engine which is built on top of Lucene internally and enjoys all the features it provides." }
673673

674+
---
675+
"Retrieve chunks with the fields api":
676+
- requires:
677+
cluster_features: "semantic_text.fields_chunks_format"
678+
reason: semantic text field supports retrieving chunks through fields API in 9.2.0.
679+
680+
- do:
681+
indices.create:
682+
index: test-index-sparse
683+
body:
684+
settings:
685+
index.mapping.semantic_text.use_legacy_format: false
686+
mappings:
687+
properties:
688+
semantic_text_field:
689+
type: semantic_text
690+
inference_id: sparse-inference-id
691+
text_field:
692+
type: text
693+
copy_to: ["semantic_text_field"]
694+
695+
- do:
696+
index:
697+
index: test-index-sparse
698+
id: doc_1
699+
body:
700+
semantic_text_field: [ "some test data", " ", "now with chunks" ]
701+
text_field: "text field data"
702+
refresh: true
703+
704+
- do:
705+
search:
706+
index: test-index-sparse
707+
body:
708+
query:
709+
match_all: { }
710+
fields: [{"field": "semantic_text_field", "format": "chunks"}]
711+
712+
- match: { hits.total.value: 1 }
713+
- match: { hits.hits.0._id: "doc_1" }
714+
- length: { hits.hits.0.fields.semantic_text_field: 3 }
715+
- match: { hits.hits.0.fields.semantic_text_field.0: "some test data" }
716+
- match: { hits.hits.0.fields.semantic_text_field.1: "now with chunks" }
717+
- match: { hits.hits.0.fields.semantic_text_field.2: "text field data" }
718+
719+
---
720+
"Highlighting with match_all in a highlight_query":
721+
- requires:
722+
cluster_features: "semantic_text.match_all_highlighter"
723+
reason: semantic text field supports match_all query with semantic highlighter, effective from 8.19 and 9.1.0.
724+
725+
- do:
726+
search:
727+
index: test-sparse-index
728+
body:
729+
query:
730+
ids: {
731+
values: ["doc_1"]
732+
}
733+
highlight:
734+
fields:
735+
body:
736+
type: "semantic"
737+
number_of_fragments: 2
738+
highlight_query: {
739+
match_all: {}
740+
}
741+
742+
- match: { hits.total.value: 1 }
743+
- match: { hits.hits.0._id: "doc_1" }
744+
- length: { hits.hits.0.highlight.body: 2 }
745+
- match: { hits.hits.0.highlight.body.0: "ElasticSearch is an open source, distributed, RESTful, search engine which is built on top of Lucene internally and enjoys all the features it provides." }
746+
- match: { hits.hits.0.highlight.body.1: "You Know, for Search!" }

0 commit comments

Comments
 (0)