Skip to content
Merged
Show file tree
Hide file tree
Changes from 63 commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
c5580da
Synthetic source doc values arrays encoding experiment
martijnvg Sep 29, 2024
acf4d09
spotless
martijnvg Sep 30, 2024
dca77d7
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Sep 30, 2024
49efe26
iter
martijnvg Sep 30, 2024
f5e3d5a
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Sep 30, 2024
59010c3
iter
martijnvg Sep 30, 2024
a5198ae
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Oct 2, 2024
9b4aa5f
spotless
martijnvg Oct 2, 2024
12d30c5
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Oct 3, 2024
2ae8d83
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Oct 8, 2024
fc0e627
parsesArrayValue approach
martijnvg Oct 8, 2024
a111f94
fix multi fields
martijnvg Oct 9, 2024
14c2ddd
iter
martijnvg Oct 9, 2024
ba9e513
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Oct 9, 2024
007afd3
do not handle copy_to for now
martijnvg Oct 9, 2024
194b4ca
cleanup
martijnvg Oct 9, 2024
52c0db4
move ValueXContentParser
martijnvg Oct 9, 2024
8cc5b46
adjust expected json element type
martijnvg Oct 9, 2024
dc9db8a
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Oct 13, 2024
6e03aca
iter
martijnvg Oct 13, 2024
674f03e
fixed mistake
martijnvg Oct 14, 2024
acfaa55
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Jan 16, 2025
0d90234
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Jan 24, 2025
259d212
iter
martijnvg Jan 24, 2025
bf9ed2f
iter
martijnvg Jan 24, 2025
7bd3a15
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Jan 24, 2025
d8e48c5
Moved processing of offsets to DocumentParser instead of in fieldmapp…
martijnvg Jan 24, 2025
ae1ce9f
[CI] Auto commit changes from spotless
Jan 24, 2025
5e610ef
* Added support for json null by encoding it as \0
martijnvg Jan 26, 2025
8f163eb
Encode offsets using zigzag encoding, so that -1 can be used to encod…
martijnvg Jan 26, 2025
43a1375
fixed codec issue if only NULLs gets indexed
martijnvg Jan 27, 2025
cf2b9a3
Update docs/changelog/113757.yaml
martijnvg Jan 27, 2025
893f555
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Jan 27, 2025
61fd132
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Jan 27, 2025
a428f11
handle string arrays for keyword fields nested under object arrays co…
martijnvg Jan 27, 2025
962ac8a
offsetsFieldMapper -> offsetsFieldName
martijnvg Jan 27, 2025
1c77cfe
added ignore above test
martijnvg Jan 27, 2025
a785110
Move OffsetDocValuesLoader to its own layer implementation
martijnvg Jan 27, 2025
fa03f46
iter
martijnvg Jan 27, 2025
ccbf0cd
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Jan 27, 2025
9664fa7
rename
martijnvg Feb 2, 2025
01cd313
move some offset logic from DocumentParserContext to new ArrayOffsetC…
martijnvg Feb 2, 2025
7ed0857
use correct full name for offset field
martijnvg Feb 2, 2025
012ac7f
Move offset parsing logic to DocumentParser
martijnvg Feb 3, 2025
4b4eaf4
[CI] Auto commit changes from spotless
Feb 3, 2025
64c6fe8
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Feb 5, 2025
38f784a
synthetic recovery source is enabled by default now (in snapshot builds)
martijnvg Feb 5, 2025
b9535e1
Small iter
martijnvg Feb 5, 2025
470afad
handle empty array logic in DocumentParser
martijnvg Feb 5, 2025
80521c2
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Feb 6, 2025
ab612ba
randomFrom
martijnvg Feb 6, 2025
f21cce6
reduce number of `setImmediateXContentParent(...)` invocations
martijnvg Feb 6, 2025
ca21c22
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Feb 15, 2025
969139e
remove `parsesArrayValue(mapper) == false` checks
martijnvg Feb 15, 2025
acf0aed
cleanup DocumentParser and
martijnvg Feb 15, 2025
3d75e27
ensure doc values enabled and added more tests
martijnvg Feb 15, 2025
5487cf8
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Feb 17, 2025
b89660a
iter
martijnvg Feb 17, 2025
ba0434b
s/:/./
martijnvg Feb 17, 2025
4bcde0d
iter
martijnvg Feb 17, 2025
5b6b05c
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Feb 17, 2025
37634b9
Store offsets in sorted doc values field instead of binary doc values…
martijnvg Feb 18, 2025
09c6a0e
applied feedback
martijnvg Feb 18, 2025
60f45f2
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Feb 19, 2025
bbee160
iter testSyntheticSourceKeepArrays() test
martijnvg Feb 19, 2025
4e6265f
add index version check
martijnvg Feb 19, 2025
405edf4
iter test
martijnvg Feb 19, 2025
7c7b3a3
[CI] Auto commit changes from spotless
Feb 19, 2025
cfe5b56
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Feb 19, 2025
8049206
Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…
martijnvg Feb 20, 2025
3fcb461
cleanup supportStoringArrayOffsets(...) method
martijnvg Feb 20, 2025
5b1f80b
renamed test suites
martijnvg Feb 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/113757.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 113757
summary: Store arrays offsets for keyword fields natively with synthetic source instead of falling back to ignored source.
area: Mapping
type: enhancement
issues: []
4 changes: 4 additions & 0 deletions rest-api-spec/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -83,4 +83,8 @@ tasks.named("yamlRestCompatTestTransform").configure ({ task ->
"node_version warning is removed in 9.0"
)
task.skipTest("tsdb/20_mapping/nested fields", "nested field support in tsdb indices is now supported")
task.skipTest("logsdb/10_settings/routing path allowed in logs mode with routing on sort fields", "Unknown feature routing.logsb_route_on_sort_fields")
task.skipTest("indices.create/21_synthetic_source_stored/index param - field ordering", "Synthetic source keep arrays now stores leaf arrays natively")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we skip instead of changing the test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because the yaml tests come from 8.x. The yamlRestCompatTestTransform task mutes or transforms tests before testing tests api backwards compatibility via yamlRestCompatTest task.

task.skipTest("indices.create/21_synthetic_source_stored/field param - keep nested array", "Synthetic source keep arrays now stores leaf arrays natively")
task.skipTest("indices.create/21_synthetic_source_stored/field param - keep root array", "Synthetic source keep arrays now stores leaf arrays natively")
})
Original file line number Diff line number Diff line change
Expand Up @@ -922,7 +922,7 @@ subobjects auto:
- match: { hits.hits.0._source.foo: 10 }
- match: { hits.hits.0._source.foo\.bar: 100 }
- match: { hits.hits.0._source.regular.span.id: "1" }
- match: { hits.hits.0._source.regular.trace.id: [ "a", "b" ] }
- match: { hits.hits.0._source.regular.trace.id: ["a", "b" ] }
- match: { hits.hits.1._source.id: 2 }
- match: { hits.hits.1._source.foo: 20 }
- match: { hits.hits.1._source.foo\.bar: 200 }
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1024,7 +1024,7 @@ index param - field ordering:
index: test

- length: { hits.hits.0._source: 4 }
- match: { hits.hits.0._source: { "a": "2", "b": [ { "bb": 100, "aa": 200 }, { "aa": 300, "bb": 400 } ], "c": [30, 20, 10], "d": [ { "bb": 10, "aa": 20 }, { "aa": 30, "bb": 40 } ] } }
- match: { hits.hits.0._source: { "a": "2", "b": [ { "bb": 100, "aa": 200 }, { "aa": 300, "bb": 400 } ], "c": ["30", "20", "10"], "d": [ { "bb": 10, "aa": 20 }, { "aa": 30, "bb": 40 } ] } }


---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,7 @@ private void internalParseDocument(MetadataFieldMapper[] metadataFieldsMappers,

executeIndexTimeScripts(context);

context.processArrayOffsets(context);
for (MetadataFieldMapper metadataMapper : metadataFieldsMappers) {
metadataMapper.postParse(context);
}
Expand Down Expand Up @@ -519,6 +520,7 @@ private static void throwOnCopyToOnObject(Mapper mapper, List<String> copyToFiel

private static void parseObject(final DocumentParserContext context, String currentFieldName) throws IOException {
assert currentFieldName != null;
context.setImmediateXContentParent(context.parser().currentToken());
Mapper objectMapper = context.getMapper(currentFieldName);
if (objectMapper != null) {
doParseObject(context, currentFieldName, objectMapper);
Expand Down Expand Up @@ -611,6 +613,12 @@ private static void throwOnCreateDynamicNestedViaCopyTo(Mapper dynamicObjectMapp
}

private static void parseArray(DocumentParserContext context, String lastFieldName) throws IOException {
// Record previous immediate parent, so that it can be reset after array has been parsed.
// This is for recording array offset with synthetic source. Only if the immediate parent is an array,
// then the offsets can be accounted accurately.
var prev = context.getImmediateXContentParent();
context.setImmediateXContentParent(context.parser().currentToken());

Mapper mapper = getLeafMapper(context, lastFieldName);
if (mapper != null) {
// There is a concrete mapper for this field already. Need to check if the mapper
Expand All @@ -624,6 +632,8 @@ private static void parseArray(DocumentParserContext context, String lastFieldNa
} else {
parseArrayDynamic(context, lastFieldName);
}
// Reset previous immediate parent
context.setImmediateXContentParent(prev);
}

private static void parseArrayDynamic(DocumentParserContext context, String currentFieldName) throws IOException {
Expand Down Expand Up @@ -688,11 +698,12 @@ private static void parseNonDynamicArray(
final String lastFieldName,
String arrayFieldName
) throws IOException {
boolean supportStoringArrayOffsets = mapper != null && mapper.supportStoringArrayOffsets(context);
String fullPath = context.path().pathAsText(arrayFieldName);

// Check if we need to record the array source. This only applies to synthetic source.
boolean canRemoveSingleLeafElement = false;
if (context.canAddIgnoredField()) {
if (context.canAddIgnoredField() && supportStoringArrayOffsets == false) {
Mapper.SourceKeepMode mode = Mapper.SourceKeepMode.NONE;
boolean objectWithFallbackSyntheticSource = false;
if (mapper instanceof ObjectMapper objectMapper) {
Expand Down Expand Up @@ -736,6 +747,7 @@ private static void parseNonDynamicArray(

XContentParser parser = context.parser();
XContentParser.Token token;
XContentParser.Token previousToken = parser.currentToken();
int elements = 0;
while ((token = parser.nextToken()) != XContentParser.Token.END_ARRAY) {
if (token == XContentParser.Token.START_OBJECT) {
Expand All @@ -754,6 +766,14 @@ private static void parseNonDynamicArray(
elements++;
parseValue(context, lastFieldName);
}
previousToken = token;
}
if (mapper != null
&& context.canAddIgnoredField()
&& mapper.supportStoringArrayOffsets(context)
&& previousToken == XContentParser.Token.START_ARRAY
&& context.isImmediateParentAnArray()) {
context.getOffSetContext().maybeRecordEmptyArray(mapper.getOffsetFieldName());
}
if (elements <= 1 && canRemoveSingleLeafElement) {
context.removeLastIgnoredField(fullPath);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,31 @@ public LuceneDocument doc() {
protected void addDoc(LuceneDocument doc) {
in.addDoc(doc);
}

@Override
public void processArrayOffsets(DocumentParserContext context) throws IOException {
in.processArrayOffsets(context);
}

@Override
public FieldArrayContext getOffSetContext() {
return in.getOffSetContext();
}

@Override
public void setImmediateXContentParent(XContentParser.Token token) {
in.setImmediateXContentParent(token);
}

@Override
public XContentParser.Token getImmediateXContentParent() {
return in.getImmediateXContentParent();
}

@Override
public boolean isImmediateParentAnArray() {
return in.isImmediateParentAnArray();
}
}

/**
Expand Down Expand Up @@ -141,6 +166,8 @@ private enum Scope {
private final SeqNoFieldMapper.SequenceIDFields seqID;
private final Set<String> fieldsAppliedFromTemplates;

private FieldArrayContext fieldArrayContext;

/**
* Fields that are copied from values of other fields via copy_to.
* This per-document state is needed since it is possible
Expand Down Expand Up @@ -460,6 +487,33 @@ public boolean isCopyToDestinationField(String name) {
return copyToFields.contains(name);
}

public void processArrayOffsets(DocumentParserContext context) throws IOException {
if (fieldArrayContext != null) {
fieldArrayContext.addToLuceneDocument(context);
}
}

public FieldArrayContext getOffSetContext() {
if (fieldArrayContext == null) {
fieldArrayContext = new FieldArrayContext();
}
return fieldArrayContext;
}

private XContentParser.Token lastSetToken;

public void setImmediateXContentParent(XContentParser.Token token) {
this.lastSetToken = token;
}

public XContentParser.Token getImmediateXContentParent() {
return lastSetToken;
}

public boolean isImmediateParentAnArray() {
return lastSetToken == XContentParser.Token.START_ARRAY;
}

/**
* Add a new mapper dynamically created while parsing.
*
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the "Elastic License
* 2.0", the "GNU Affero General Public License v3.0 only", and the "Server Side
* Public License v 1"; you may not use this file except in compliance with, at
* your election, the "Elastic License 2.0", the "GNU Affero General Public
* License v3.0 only", or the "Server Side Public License, v 1".
*/

package org.elasticsearch.index.mapper;

import org.apache.lucene.document.SortedDocValuesField;
import org.apache.lucene.util.BitUtil;
import org.elasticsearch.common.io.stream.BytesStreamOutput;
import org.elasticsearch.common.io.stream.StreamInput;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

public class FieldArrayContext {

private final Map<String, Offsets> offsetsPerField = new HashMap<>();

void recordOffset(String field, String value) {
Offsets arrayOffsets = offsetsPerField.computeIfAbsent(field, k -> new Offsets());
int nextOffset = arrayOffsets.currentOffset++;
var offsets = arrayOffsets.valueToOffsets.computeIfAbsent(value, s -> new ArrayList<>(2));
offsets.add(nextOffset);
}

void recordNull(String field) {
Offsets arrayOffsets = offsetsPerField.computeIfAbsent(field, k -> new Offsets());
int nextOffset = arrayOffsets.currentOffset++;
arrayOffsets.nullValueOffsets.add(nextOffset);
}

void maybeRecordEmptyArray(String field) {
Copy link
Contributor

@lkts lkts Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this if recordOffset and recordNull do the same thing in the beginning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invoked when END_ARRAY is encountered. At this time, there may be an entry for the field, if not an empty Offsets will be added, that later will be used the record an empty error in the related offset set.

offsetsPerField.computeIfAbsent(field, k -> new Offsets());
}

void addToLuceneDocument(DocumentParserContext context) throws IOException {
for (var entry : offsetsPerField.entrySet()) {
var fieldName = entry.getKey();
var offset = entry.getValue();

int currentOrd = 0;
// This array allows to retain the original ordering of elements in leaf arrays and retain duplicates.
int[] offsetToOrd = new int[offset.currentOffset];
for (var offsetEntry : offset.valueToOffsets.entrySet()) {
for (var offsetAndLevel : offsetEntry.getValue()) {
offsetToOrd[offsetAndLevel] = currentOrd;
}
currentOrd++;
}
for (var nullOffset : offset.nullValueOffsets) {
offsetToOrd[nullOffset] = -1;
}

try (var streamOutput = new BytesStreamOutput()) {
// Could just use vint for array length, but this allows for decoding my_field: null as -1
streamOutput.writeVInt(BitUtil.zigZagEncode(offsetToOrd.length));
for (int ord : offsetToOrd) {
streamOutput.writeVInt(BitUtil.zigZagEncode(ord));
}
context.doc().add(new SortedDocValuesField(fieldName, streamOutput.bytes().toBytesRef()));
}
}
}

static int[] parseOffsetArray(StreamInput in) throws IOException {
int[] offsetToOrd = new int[BitUtil.zigZagDecode(in.readVInt())];
for (int i = 0; i < offsetToOrd.length; i++) {
offsetToOrd[i] = BitUtil.zigZagDecode(in.readVInt());
}
return offsetToOrd;
}

private static class Offsets {

int currentOffset;
// Need to use TreeMap here, so that we maintain the order in which each value (with offset) stored inserted,
// (which is in the same order the document gets parsed) so we store offsets in right order. This is the same
// order in what the values get stored in SortedSetDocValues.
final Map<String, List<Integer>> valueToOffsets = new TreeMap<>();
final List<Integer> nullValueOffsets = new ArrayList<>(2);

}

}
Original file line number Diff line number Diff line change
Expand Up @@ -200,15 +200,15 @@ public void parse(DocumentParserContext context) throws IOException {
}
}

private void doParseMultiFields(DocumentParserContext context) throws IOException {
protected void doParseMultiFields(DocumentParserContext context) throws IOException {
context.path().add(leafName());
for (FieldMapper mapper : builderParams.multiFields.mappers) {
mapper.parse(context);
}
context.path().remove();
}

private static void throwIndexingWithScriptParam() {
protected static void throwIndexingWithScriptParam() {
throw new IllegalArgumentException("Cannot index data directly into a field with a [script] parameter");
}

Expand Down
Loading