Store arrays offsets for numeric fields natively with synthetic source #124594

jordan-powers · 2025-03-11T20:22:57Z

This patch builds on the work in #122999 and #113757 to natively store array offsets for numeric fields instead of falling back to ignored source when source_keep_mode: arrays.

Nested contexts don't work right with the current offset context logic. For now, we can disable native synthetic source support and fall back to the ignored source mechanism. We can revisit to support native synthetic source within nested contexts at a later date.

This patch removes the context.isImmediateParentAnArray() check when deciding whether to store an array offset. This is necessary in case we are parsing fields within an object array. F.e. for the document `{"path":[{"int_value":10},{"int_value":20}]}` with `synthetic_source_keep: arrays` on the `int_value` field, we'd want to store the offsets for `int_value` even though the immediate parent is an object and not an array.

Single-element arrays can be unwraped into arrayless field values, so we need to handle that case.

…ng-numbers

elasticsearchmachine · 2025-03-11T20:27:12Z

Hi @jordan-powers, I've created a changelog YAML for you.

elasticsearchmachine · 2025-03-11T20:27:12Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

jordan-powers · 2025-03-11T20:49:00Z

server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

+        } else {
+            value = fieldType().nullValue;
+        }
+        if (offsetsFieldName != null && context.canAddIgnoredField()) {


I don't check context.isImmediateParentAnArray() because we also need to record the offsets for object arrays.

There was a failing test IgnoredSourceFieldMapperTests#testIndexStoredArraySourceSingleLeafElementInObjectArray

It had the following:
mapping: {"path": {"type": "object", "properties": { "int_value": { "type": "long", "synthetic_source_keep": "arrays" }}}}
document: {"path":[{"int_value":10},{"int_value":20}]}

The values for int_value were not immediate children of arrays, and so the DocumentParser would fall back to the ignored source mechanism. Then, when the synthetic _source was generated, the resultant document would look like this:
{"path":{"int_value":[10, 20]}}

Then, when this document was round-tripped and indexed in a new synthetic source index, the values would be immediate children of arrays, so the offsets would be recorded.

This caused a mismatch in the index writers (since one had the offsets field and the other did not), and so the test would fail.

My solution is to always just always record the offsets, even when the immediate parent is not an array.

This has a couple of drawbacks:

This does create an inefficiency where we're potentially storing the offsets even for single-value arrays. We could probably fix that by adding a check to FieldArrayContext#addToLuceneDocument to skip recording the offsets if there's only one non-null value.

In the source loader, we can't rely anymore on the presence of the offset field to differentiate between a single value (f.e. field: 5) and a single-value array (f.e. field: [5]). As such, my current implementation of the loader always unwraps single-value arrays into a single value. I think this is fine because it matches the ignored source implementation.

I think we need to check for context.isImmediateParentAnArray() here. The current offset encoding doesn't work well if there the immediate parent is not an array. If there is an object array higher up in the tree, then the array is pushed down to the leaf when synthesizing. This is incorrect, as this is not how the document was provided during indexing. This is why we fall back to ignored source in this case.

I suspect that the reason the test fails (with context.isImmediateParentAnArray() check) is that initially the document gets index using an array, but SortedNumericWithOffsetsDocValuesSyntheticFieldLoader normalizes that array to a value, then the round trip indexing doesn't have an array and therefor there is no offset field. (because context.isImmediateParentAnArray() returns false)

The problem is actually the other way around.

Here's the error:
java.lang.AssertionError: round trip {"path":{"int_value":[10,20]}} expected:<[_seq_no, _version, _primary_term, path.int_value]> but was:<[_seq_no, path.int_value.offsets, _version, _primary_term, path.int_value]>

The first time the document is indexed, the document has the structure {"path":[{"int_value":10},{"int_value":20}]}. Since 10 and 20 are not immediate children of arrays, instead the fallback ignored source mechanism kicks in (via the check in DocumentParser#parseObjectOrField()).

However, one of the modifications made by synthetic source is that arrays are moved to leaf fields. So when the synthetic source is returned, instead the document has the structure {"path":{"int_value":[10, 20]}}.
When this modified document is indexed, the values 10 and 20 are now immediate children of an array, and so the offset encoding happens.

This means that the offsets field does not exist the first time the document is indexed, but it exists the second time after the roundtrip.

If arrays are moved to leafs anyway why can't we encode them properly (unless the parent has synthetic_source_keep configured)?

You'll still have this problem with *.offsets field, you will need to exclude it from the check. See validateRoundTripReader method.

Thanks @lkts , I was able to mask the "*.offsets" field in assertReaderEquals and the tests pass now.

While there still is a difference in how the values are stored in lucene when initially indexed vs. when re-indexed, I think it's ok because the resultant _source is the same.

I'm not sure if we reached a conclusion on whether to support encoding object arrays using offsets, but if we decide to do so, it can happen in a follow-up PR. For now, we still encode object arrays using ignored source and only use offset encoding for leaf arrays.

jordan-powers · 2025-03-11T20:52:25Z

server/src/test/java/org/elasticsearch/index/mapper/NativeArrayIntegrationTestCase.java

+            } else {
+                assertThat(actualArray, Matchers.contains(expected));
+            }
+        } else {


Had to update this check since single-value arrays are unwrapped to single-values

So single slot arrays should be retained and now be normalized to a single value. This works for ip and keyword field as well. So this should work for number field types too?

This was also part of the change removing the check for context.isImmediateParentAnArray(), since the change caused single-slot arrays to be returned as single values.

Since we're rethinking that solution, I'll probably end up reverting this too.

martijnvg

Thanks @jordan-powers, I did my first review round. This looks into the right direction.

martijnvg · 2025-03-12T08:21:37Z

server/src/test/java/org/elasticsearch/index/mapper/NativeArrayIntegrationTestCase.java

        var arrayValues = new Object[randomInt(64)];
        for (int j = 0; j < arrayValues.length; j++) {
-            arrayValues[j] = NetworkAddress.format(randomIp(true));
+            arrayValues[j] = getRandomValue();


martijnvg · 2025-03-12T08:39:06Z

server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

+        if (offsetsFieldName != null && context.canAddIgnoredField()) {
+            if (value != null) {
+                final long sortableLongValue = type.sortableLongValue(value);
+                context.getOffSetContext().recordOffset(offsetsFieldName, sortableLongValue);


I think that casting the value variable to Comparable works too?

Suggested change

context.getOffSetContext().recordOffset(offsetsFieldName, sortableLongValue);

context.getOffSetContext().recordOffset(offsetsFieldName, (Comparable<?>) value);

This way we don't need the type.sortableLongValue(...) method?

I'll give it a shot

You can also make sortableLongValue return Number with that.

martijnvg · 2025-03-12T08:45:06Z

server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java

-                        || (sourceKeepMode == Mapper.SourceKeepMode.ARRAYS && context.inArrayScope())
+                        || (sourceKeepMode == Mapper.SourceKeepMode.ARRAYS
+                            && context.inArrayScope()
+                            && fieldMapper.supportStoringArrayOffsets() == false)


Can you explain why this change is needed?

This was part of the change removing the check for context.isImmediateParentAnArray(). Adding this to the check disabled the fallback source for that object array so that we could use the offset encoding.

Since we're rethinking that solution, I'll probably end up reverting this.

martijnvg · 2025-03-12T08:46:40Z

server/src/main/java/org/elasticsearch/index/mapper/FieldArrayContext.java

            && sourceKeepMode == Mapper.SourceKeepMode.ARRAYS
            && hasDocValues
            && isStored == false
+            && context.isInNestedContext() == false


This makes sense. Can you add a test to NativeArrayIntegrationTestCase, which checks that if leaf array field has nested parent field, then we always fall back to ignored source?

martijnvg · 2025-03-12T09:52:56Z

server/src/test/java/org/elasticsearch/index/mapper/NativeArrayIntegrationTestCase.java

+            } else {
+                assertThat(actualArray, Matchers.contains(expected));
+            }
+        } else {


So single slot arrays should be retained and now be normalized to a single value. This works for ip and keyword field as well. So this should work for number field types too?

martijnvg · 2025-03-12T09:55:08Z

server/src/test/java/org/elasticsearch/index/mapper/OffsetDocValuesLoaderTestCase.java

    }

    public void testOffsetArrayRandom() throws Exception {
-        StringBuilder values = new StringBuilder();


Is this change required to make tests pass?

The original implementation converted everything to quoted strings (f.e. ["4", "-1", "20"]). However, elasticsearch would convert them to numbers when returning the synthetic source (f.e. [4, -1, 20]), causing tests to fail. I needed some way to convert the random values into quoted or unquoted strings depending on the type. I figured that since XContentBuilder already has that logic, I'd just pass it down.

Alternatively I could change the return type of getRandomValue() to Object, then use the array(String name, Object... values) method on XContentBuilder.

martijnvg · 2025-03-12T11:36:58Z

...va/org/elasticsearch/index/mapper/SortedNumericWithOffsetsDocValuesSyntheticFieldLoader.java

+
+import java.io.IOException;
+
+class SortedNumericWithOffsetsDocValuesSyntheticFieldLoader extends SourceLoader.DocValuesBasedSyntheticFieldLoader {


I think we should instead implement this is a CompositeSyntheticFieldLoader.DocValuesLayer. (like SortedSetWithOffsetsDocValuesSyntheticFieldLoaderLayer)

martijnvg · 2025-03-12T11:40:00Z

...va/org/elasticsearch/index/mapper/SortedNumericWithOffsetsDocValuesSyntheticFieldLoader.java

+                return;
+            }
+
+            int count = count();


If this class implements CompositeSyntheticFieldLoader.DocValuesLayer then we don't need the logic here that determine whether something needs to be serialized as array. And we can serialize regular number values as regular number values (instead of wrapping them in an array)

martijnvg · 2025-03-12T11:42:28Z

server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

+        } else {
+            value = fieldType().nullValue;
+        }
+        if (offsetsFieldName != null && context.canAddIgnoredField()) {


I think we need to check for context.isImmediateParentAnArray() here. The current offset encoding doesn't work well if there the immediate parent is not an array. If there is an object array higher up in the tree, then the array is pushed down to the leaf when synthesizing. This is incorrect, as this is not how the document was provided during indexing. This is why we fall back to ignored source in this case.

martijnvg · 2025-03-12T11:54:09Z

server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

+        } else {
+            value = fieldType().nullValue;
+        }
+        if (offsetsFieldName != null && context.canAddIgnoredField()) {


I suspect that the reason the test fails (with context.isImmediateParentAnArray() check) is that initially the document gets index using an array, but SortedNumericWithOffsetsDocValuesSyntheticFieldLoader normalizes that array to a value, then the round trip indexing doesn't have an array and therefor there is no offset field. (because context.isImmediateParentAnArray() returns false)

This reverts commit f0d6914.

…ng-numbers

martijnvg

I left one comment about testing the number field type via NativeArrayIntegrationTestCase, but other than that this change LGTM.

martijnvg · 2025-03-19T13:55:15Z

server/src/test/java/org/elasticsearch/index/mapper/OffsetDocValuesLoaderTestCase.java

-            if (i != (numValues - 1)) {
-                values.append(',');
+
+        var previousValues = new HashSet<Object>();


nit:

Suggested change

var previousValues = new HashSet<Object>();

var previousValues = new HashSet<>();

martijnvg · 2025-03-19T13:59:40Z

...ord/src/test/java/org/elasticsearch/xpack/countedkeyword/CountedKeywordFieldMapperTests.java


    public void testSyntheticSourceIndexLevelKeepArrays() throws IOException {
-        SyntheticSourceExample example = syntheticSourceSupportForKeepTests(shouldUseIgnoreMalformed()).example(1);
+        SyntheticSourceExample example = syntheticSourceSupportForKeepTests(shouldUseIgnoreMalformed(), Mapper.SourceKeepMode.ARRAYS)


Cool, makes sense.

martijnvg · 2025-03-19T14:00:54Z

...test/java/org/elasticsearch/index/mapper/LongSyntheticSourceNativeArrayIntegrationTests.java

+
+    @Override
+    protected String getFieldTypeName() {
+        return "long";


I think subclasses for other number types are still missing?

…ng-numbers

elasticsearchmachine · 2025-03-20T01:46:11Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 124594

…stic#124594) This patch builds on the work in elastic#122999 and elastic#113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`.

jordan-powers · 2025-03-21T19:50:12Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

#124594) | Fix ignores malformed testcase (#125337) | Fix offsets not recording duplicate values (#125354) (#125440) * Natively store synthetic source array offsets for numeric fields (#124594) This patch builds on the work in #122999 and #113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`. (cherry picked from commit 376abfe) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java * Fix ignores malformed testcase (#125337) Fix and unmute testSynthesizeArrayRandomIgnoresMalformed (cherry picked from commit 2ff03ac) # Conflicts: # muted-tests.yml * Fix offsets not recording duplicate values (#125354) Previously, when calculating the offsets, we just compared the values as-is without any loss of precision. However, when the values were saved into doc values and loaded in the doc values loader, they could have lost precision. This meant that values that were not duplicates when calculating the offsets could now be duplicates in the doc values loader. This interfered with the de-duplication logic, causing incorrect values to be returned. My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the SortedNumericDocValues de-duplication see the same values as duplicates. (cherry picked from commit db73175)

#125529) This patch builds on the work in #113757, #122999, and #124594 to natively store array offsets for boolean fields instead of falling back to ignored source when `synthetic_source_keep: arrays`.

#125529) (#125596) This patch builds on the work in #113757, #122999, and #124594 to natively store array offsets for boolean fields instead of falling back to ignored source when `synthetic_source_keep: arrays`. (cherry picked from commit af1f145) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # server/src/main/java/org/elasticsearch/index/mapper/BooleanFieldMapper.java

… source (#125709) This patch builds on the work in #113757, #122999, #124594, and #125529 to natively store array offsets for unsigned long fields instead of falling back to ignored source when synthetic_source_keep: arrays.

… source (#125709) (#125746) This patch builds on the work in #113757, #122999, #124594, and #125529 to natively store array offsets for unsigned long fields instead of falling back to ignored source when synthetic_source_keep: arrays. (cherry picked from commit 689eaf2) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # x-pack/plugin/mapper-unsigned-long/src/main/java/org/elasticsearch/xpack/unsignedlong/UnsignedLongFieldMapper.java

…source (#125793) This patch builds on the work in #113757, #122999, #124594, #125529, and #125709 to natively store array offsets for scaled float fields instead of falling back to ignored source when synthetic_source_keep: arrays.

…stic#124594) This patch builds on the work in elastic#122999 and elastic#113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`.

elastic#125529) This patch builds on the work in elastic#113757, elastic#122999, and elastic#124594 to natively store array offsets for boolean fields instead of falling back to ignored source when `synthetic_source_keep: arrays`.

… source (elastic#125709) This patch builds on the work in elastic#113757, elastic#122999, elastic#124594, and elastic#125529 to natively store array offsets for unsigned long fields instead of falling back to ignored source when synthetic_source_keep: arrays.

…source (elastic#125793) This patch builds on the work in elastic#113757, elastic#122999, elastic#124594, elastic#125529, and elastic#125709 to natively store array offsets for scaled float fields instead of falling back to ignored source when synthetic_source_keep: arrays.

…source (#125793) (#125891) This patch builds on the work in #113757, #122999, #124594, #125529, and #125709 to natively store array offsets for scaled float fields instead of falling back to ignored source when synthetic_source_keep: arrays. (cherry picked from commit 71e74bd) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java

jordan-powers added 13 commits March 7, 2025 15:20

WIP numeric fields native synthetic array support

6213d87

Fix hasValue logic for case of empty array

3d79319

Fix missing field name for all-null arrays

e3961ec

Add LongOffsetDocValuesLoaderTests.java

484dabb

Use abstract getRandomValue for testSynthesizeArrayRandom

188a478

Use correct fieldType in verifySyntheticArrayInObject

1ea276b

Avoid string-encoding expected values

0656ba6

Don't assume response will be array in NativeArrayIntegrationTestCase

b790084

Single-element arrays can be unwraped into arrayless field values, so we need to handle that case.

Add LongSyntheticSourceNativeArrayIntegrationTests

2533b0b

Fix x-pack build issues

578fc42

Merge remote-tracking branch 'upstream/main' into array-offset-encodi…

7bfd82b

…ng-numbers

jordan-powers requested a review from martijnvg March 11, 2025 20:22

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels Mar 11, 2025

jordan-powers added >enhancement auto-backport Automatically create backport pull requests when merged test-full-bwc Trigger full BWC version matrix tests :StorageEngine/Mapping The storage related side of mappings v8.19.0 labels Mar 11, 2025

elasticsearchmachine added Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels Mar 11, 2025

Update docs/changelog/124594.yaml

3d7347c

jordan-powers commented Mar 11, 2025

View reviewed changes

martijnvg reviewed Mar 12, 2025

View reviewed changes

Revert "Always store array offsets"

6ab26af

This reverts commit f0d6914.

jordan-powers added 2 commits March 18, 2025 15:56

Add doc values loader tests for other numeric types

321f652

Merge remote-tracking branch 'upstream/main' into array-offset-encodi…

1b3472b

…ng-numbers

martijnvg approved these changes Mar 19, 2025

View reviewed changes

jordan-powers added 5 commits March 19, 2025 12:08

Add native array integration tests for other numeric types

96534a0

Merge remote-tracking branch 'upstream/main' into array-offset-encodi…

61831f1

…ng-numbers

Remove 'Object' from type parameter

4efc7fe

Fix native array integration tests

9ca9c7c

Merge remote-tracking branch 'upstream/main' into array-offset-encodi…

47f3bbc

…ng-numbers

jordan-powers merged commit 376abfe into elastic:main Mar 20, 2025
17 checks passed

elasticsearchmachine added the backport pending label Mar 20, 2025

jordan-powers mentioned this pull request Mar 21, 2025

[8.x] Natively store synthetic source array offsets for numeric fields (#124594) | Fix ignores malformed testcase (#125337) | Fix offsets not recording duplicate values (#125354) #125440

Merged

jordan-powers removed the backport pending label Mar 24, 2025

jordan-powers mentioned this pull request Mar 24, 2025

Store arrays offsets for boolean fields natively with synthetic source #125529

Merged

jordan-powers mentioned this pull request Mar 26, 2025

Store arrays offsets for unsigned long fields natively with synthetic source #125709

Merged

jordan-powers mentioned this pull request Mar 27, 2025

Store arrays offsets for scaled float fields natively with synthetic source #125793

Merged

jordan-powers deleted the array-offset-encoding-numbers branch April 1, 2025 07:06

	context.getOffSetContext().recordOffset(offsetsFieldName, sortableLongValue);
	context.getOffSetContext().recordOffset(offsetsFieldName, (Comparable<?>) value);


		import java.io.IOException;

		class SortedNumericWithOffsetsDocValuesSyntheticFieldLoader extends SourceLoader.DocValuesBasedSyntheticFieldLoader {

	var previousValues = new HashSet<Object>();
	var previousValues = new HashSet<>();

Store arrays offsets for numeric fields natively with synthetic source #124594

Store arrays offsets for numeric fields natively with synthetic source #124594

Uh oh!

Conversation

jordan-powers commented Mar 11, 2025

Uh oh!

elasticsearchmachine commented Mar 11, 2025

Uh oh!

elasticsearchmachine commented Mar 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 20, 2025

💔 Backport failed

Uh oh!

jordan-powers commented Mar 21, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Milestone