-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix offsets not recording duplicate values #125354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix offsets not recording duplicate values #125354
Conversation
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! One small comment - LGTM 👍
| boolean stored | ||
| ); | ||
|
|
||
| public abstract long toSortableLong(Number value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we have a good reason to have this method :).
Maybe add documentation on this method to explain why we need to method?
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
#124594) | Fix ignores malformed testcase (#125337) | Fix offsets not recording duplicate values (#125354) (#125440) * Natively store synthetic source array offsets for numeric fields (#124594) This patch builds on the work in #122999 and #113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`. (cherry picked from commit 376abfe) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java * Fix ignores malformed testcase (#125337) Fix and unmute testSynthesizeArrayRandomIgnoresMalformed (cherry picked from commit 2ff03ac) # Conflicts: # muted-tests.yml * Fix offsets not recording duplicate values (#125354) Previously, when calculating the offsets, we just compared the values as-is without any loss of precision. However, when the values were saved into doc values and loaded in the doc values loader, they could have lost precision. This meant that values that were not duplicates when calculating the offsets could now be duplicates in the doc values loader. This interfered with the de-duplication logic, causing incorrect values to be returned. My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the SortedNumericDocValues de-duplication see the same values as duplicates. (cherry picked from commit db73175)
Previously, when calculating the offsets, we just compared the values as-is without any loss of precision. However, when the values were saved into doc values and loaded in the doc values loader, they could have lost precision. This meant that values that were not duplicates when calculating the offsets could now be duplicates in the doc values loader. This interfered with the de-duplication logic, causing incorrect values to be returned. My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the SortedNumericDocValues de-duplication see the same values as duplicates.
While investigating a failing CI test for #125337, I discovered a bug in our current offset logic.
Basically, when calculating the offsets, we just compare the values as-is without any loss of precision. However, when the values are saved into doc values and loaded in the doc values loader, they will have lost precision. This means that values that were not duplicates when calculating the offsets will now be duplicates in the doc values loader. This interferes with the de-duplication logic, causing incorrect values to be returned.
Here's a concrete example.
This value is indexed into a
type: half-floatfield:[0.78151345, 0.6886488, 0.6882413].The corresponding offsets will be
[3, 2, 1].However, once the value is saved into the doc values and re-loaded, precision is lost and the
SortedNumericDocValuesbecome[0.68847656,0.68847656,0.7817383,]. Note that the first two values are now duplicates.Because of the de-duplication logic, in
NumericDocValuesWithOffsetsLoader#write, thevaluesarray is then set to[0.68847656, 0.7817383, 0].Finally, when the offsets are used to reconstruct the source using this values array, the resultant source is
[0, 0.7817383, 0.68847656], which does not match the original source.My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the
SortedNumericDocValuesde-duplication see the same values as duplicates.