Skip to content

Conversation

@jordan-powers
Copy link
Contributor

While investigating a failing CI test for #125337, I discovered a bug in our current offset logic.

Basically, when calculating the offsets, we just compare the values as-is without any loss of precision. However, when the values are saved into doc values and loaded in the doc values loader, they will have lost precision. This means that values that were not duplicates when calculating the offsets will now be duplicates in the doc values loader. This interferes with the de-duplication logic, causing incorrect values to be returned.

Here's a concrete example.
This value is indexed into a type: half-float field: [0.78151345, 0.6886488, 0.6882413].
The corresponding offsets will be [3, 2, 1].
However, once the value is saved into the doc values and re-loaded, precision is lost and the SortedNumericDocValues become [0.68847656,0.68847656,0.7817383,]. Note that the first two values are now duplicates.
Because of the de-duplication logic, in NumericDocValuesWithOffsetsLoader#write, the values array is then set to [0.68847656, 0.7817383, 0].
Finally, when the offsets are used to reconstruct the source using this values array, the resultant source is [0, 0.7817383, 0.68847656], which does not match the original source.

My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the SortedNumericDocValues de-duplication see the same values as duplicates.

@jordan-powers jordan-powers added >non-issue :StorageEngine/Mapping The storage related side of mappings v9.1.0 labels Mar 20, 2025
@jordan-powers jordan-powers requested a review from martijnvg March 20, 2025 22:07
@jordan-powers jordan-powers self-assigned this Mar 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! One small comment - LGTM 👍

boolean stored
);

public abstract long toSortableLong(Number value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have a good reason to have this method :).
Maybe add documentation on this method to explain why we need to method?

@jordan-powers jordan-powers enabled auto-merge (squash) March 21, 2025 16:40
@jordan-powers
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

elasticsearchmachine pushed a commit that referenced this pull request Mar 21, 2025
#124594) | Fix ignores malformed testcase (#125337) | Fix offsets not recording duplicate values (#125354) (#125440)

* Natively store synthetic source array offsets for numeric fields (#124594)

This patch builds on the work in #122999 and #113757 to natively store
array offsets for numeric fields instead of falling back to ignored source
when `source_keep_mode: arrays`.

(cherry picked from commit 376abfe)

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/IndexVersions.java
#	server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

* Fix ignores malformed testcase (#125337)

Fix and unmute testSynthesizeArrayRandomIgnoresMalformed

(cherry picked from commit 2ff03ac)

# Conflicts:
#	muted-tests.yml

* Fix offsets not recording duplicate values (#125354)

Previously, when calculating the offsets, we just compared the values as-is
without any loss of precision. However, when the values were saved into doc
values and loaded in the doc values loader, they could have lost precision.
This meant that values that were not duplicates when calculating the
offsets could now be duplicates in the doc values loader. This interfered
with the de-duplication logic, causing incorrect values to be returned.

My solution is to apply the precision loss before calculating the offsets,
so that both the offsets calculation and the SortedNumericDocValues
de-duplication see the same values as duplicates.

(cherry picked from commit db73175)
@jordan-powers jordan-powers deleted the fix-offsets-missing-duplicates branch March 25, 2025 21:48
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025
Previously, when calculating the offsets, we just compared the values as-is 
without any loss of precision. However, when the values were saved into doc 
values and loaded in the doc values loader, they could have lost precision.
This meant that values that were not duplicates when calculating the
offsets could now be duplicates in the doc values loader. This interfered
with the de-duplication logic, causing incorrect values to be returned.

My solution is to apply the precision loss before calculating the offsets,
so that both the offsets calculation and the SortedNumericDocValues
de-duplication see the same values as duplicates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants