[CLEAN] Synthetic Benchmark PR #137483 - Store keyword fields that trip ignore_above in binary doc values #27

tomerqodo · 2025-12-04T20:37:12Z

User description

Benchmark PR elastic#137483

Type: Clean (correct implementation)

Original PR Title: Store keyword fields that trip ignore_above in binary doc values
Original PR Description: https://github.com/elastic/logs-program/issues/13
Original PR URL: elastic#137483

PR Type

Enhancement

Description

Store ignored keyword values in binary doc values instead of stored fields
Add BinaryDocValuesSyntheticFieldLoaderLayer for reading binary doc values
Implement CustomBinaryDocValues wrapper for decoding binary doc values
Refactor synthetic source field fetchers to use binary doc values
Deduplicate ignored values using MultiValuedBinaryDocValuesField

Diagram Walkthrough

flowchart LR
  A["Ignored Keyword Values"] -->|Previously: StoredField| B["Stored Fields"]
  A -->|Now: Binary DocValues| C["BinaryDocValuesSyntheticFieldLoaderLayer"]
  C --> D["CustomBinaryDocValues Wrapper"]
  D --> E["Synthetic Source"]
  F["MultiValuedBinaryDocValuesField"] -->|Deduplicates| C

File Walkthrough

Relevant files

Enhancement

MatchOnlyTextFieldMapper.java `Refactor to use binary doc values for ignored values` modules/mapper-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java Add imports for `BinaryDocValues`, `DocValues`, `ByteArrayStreamInput`, and `SortedBinaryDocValues` Refactor `parentFieldFetcher()` to use `ignoredValuesDocValuesFieldFetcher()` instead of `storedFieldFetcher()` Refactor `delegateFieldFetcher()` to simplify field name handling and use binary doc values Extract `getValuesFromDocValues()` method to decode values from `SortedBinaryDocValues` Add `ignoredValuesDocValuesFieldFetcher()` method to read ignored values from binary doc values Add `CustomBinaryDocValues` inner class to wrap `BinaryDocValues` and expose `SortedBinaryDocValues` interface	+82/-28
BinaryDocValuesSyntheticFieldLoaderLayer.java `New layer for reading binary doc values` server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java New file implementing `CompositeSyntheticFieldLoader.DocValuesLayer` interface Reads binary doc values from a field and decodes them using `ByteArrayStreamInput` Handles doc value format: `[count][length1][value1][length2][value2]...` Provides methods to write decoded values to `XContentBuilder` and check value existence	+84/-0
KeywordFieldMapper.java `Store ignored values in binary doc values field` server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java Add imports for `ElasticsearchException`, `BytesStreamOutput`, and `LinkedHashSet` Replace `StoredField` with `MultiValuedBinaryDocValuesField` for storing ignored values Replace `StoredFieldLayer` with `BinaryDocValuesSyntheticFieldLoaderLayer` in synthetic source support Add `MultiValuedBinaryDocValuesField` inner class to store and encode multiple binary values Implement deduplication of ignored values using `LinkedHashSet` Encode values as binary doc values with count prefix and length-prefixed entries	+59/-8
NativeArrayIntegrationTestCase.java `Collapse single-element arrays in source` test/framework/src/main/java/org/elasticsearch/index/mapper/NativeArrayIntegrationTestCase.java Modify `arrayToSource()` to collapse single-element arrays into scalar fields Skip synthesizing source for arrays containing only null values Maintain array format for multi-element arrays	+10/-4
WildcardFieldMapper.java `Use shared binary doc values layer` x-pack/plugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java Remove unused imports: `BinaryDocValues`, `LeafReader`, `ByteArrayStreamInput` Add import for `BinaryDocValuesSyntheticFieldLoaderLayer` Replace `WildcardSyntheticFieldLoader` with `BinaryDocValuesSyntheticFieldLoaderLayer` Remove `WildcardSyntheticFieldLoader` inner class implementation	+2/-54

Tests

KeywordSyntheticSourceNativeArrayIntegrationTests.java `Update test for deduplicated ignored values` server/src/test/java/org/elasticsearch/index/mapper/KeywordSyntheticSourceNativeArrayIntegrationTests.java Add blank line for formatting Update test to verify deduplication of ignored values in synthetic source Add `expectedArrayValues` to show deduplicated results Update test assertion to use new expected values	+11/-1
10_basic.yml `Update test data for deduplication` modules/mapper-extras/src/yamlRestTest/resources/rest-api-spec/test/match_only_text/10_basic.yml Add duplicate value `"Apache Lucene"` to test input array Update expected synthetic source output to include the duplicate value	+2/-2

qodo-code-review · 2025-12-04T20:37:46Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: No auditing: Newly added logic reads and writes binary doc values without emitting any audit logs for critical actions, making it unclear whether sensitive reads/writes are traceable. Referred Code @Override public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException { BinaryDocValues docValues = leafReader.getBinaryDocValues(fieldName); // there are no values associated with this field if (docValues == null) { valueCount = 0; return null; } return docId -> { // there are no more documents to process if (docValues.advanceExact(docId) == false) { valueCount = 0; return false; } // otherwise, extract the doc values into a stream to later read from BytesRef docValuesBytes = docValues.binaryValue(); stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length); valueCount = stream.readVInt(); ... (clipped 13 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Error handling gaps: Newly introduced doc values reading and decoding paths lack explicit handling/logging for malformed data, IO failures, or unexpected counts, relying on upstream exceptions without contextual messages. Referred Code private IOFunction<LeafReaderContext, CheckedIntFunction<List<Object>, IOException>> ignoredValuesDocValuesFieldFetcher( String fieldName ) { return context -> { CustomBinaryDocValues ignoredValuesDocValues = new CustomBinaryDocValues(DocValues.getBinary(context.reader(), fieldName)); return docId -> getValuesFromDocValues(ignoredValuesDocValues, docId); }; } private List<Object> getValuesFromDocValues(SortedBinaryDocValues docValues, int docId) throws IOException { if (docValues.advanceExact(docId)) { var values = new ArrayList<>(docValues.docValueCount()); for (int i = 0; i < docValues.docValueCount(); i++) { values.add(docValues.nextValue().utf8ToString()); } return values; } else { return List.of(); } } Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Unvalidated decode: Decoding of length-prefixed binary doc values trusts encoded counts/lengths without explicit bounds checks in this layer, which could risk misuse if upstream contracts change. Referred Code @Override public void write(XContentBuilder b) throws IOException { for (int i = 0; i < valueCount; i++) { // this function already knows how to decode the underlying bytes array, so no need to explicitly call VInt() BytesRef valueBytes = stream.readBytesRef(); b.value(valueBytes.utf8ToString()); } Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2025-12-04T20:38:53Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix incorrect ignored value fetching Fix the `delegateFieldFetcher` method to correctly retrieve ignored values by using the `keywordDelegate`'s fallback field name and the `ignoredValuesDocValuesFieldFetcher`. modules/mapper-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java [332-359] private IOFunction<LeafReaderContext, CheckedIntFunction<List<Object>, IOException>> delegateFieldFetcher( final SearchExecutionContext searchExecutionContext, final KeywordFieldMapper.KeywordFieldType keywordDelegate ) { if (keywordDelegate.ignoreAbove().valuesPotentiallyIgnored()) { String delegateFieldName = keywordDelegate.name(); - // bc we don't know whether the delegate will ignore a value, we must also check the fallback field created by this - // match_only_text field - String fallbackName = syntheticSourceFallbackFieldName(); + String delegateFallbackName = keywordDelegate.syntheticSourceFallbackFieldName(); if (keywordDelegate.isStored()) { - return storedFieldFetcher(delegateFieldName, fallbackName); + return combineFieldFetchers(storedFieldFetcher(delegateFieldName), ignoredValuesDocValuesFieldFetcher(delegateFallbackName)); } else if (keywordDelegate.hasDocValues()) { var ifd = searchExecutionContext.getForField(keywordDelegate, MappedFieldType.FielddataOperation.SEARCH); - return combineFieldFetchers(docValuesFieldFetcher(ifd), storedFieldFetcher(fallbackName)); + return combineFieldFetchers(docValuesFieldFetcher(ifd), ignoredValuesDocValuesFieldFetcher(delegateFallbackName)); } } if (keywordDelegate.isStored()) { return storedFieldFetcher(keywordDelegate.name()); } else if (keywordDelegate.hasDocValues()) { var ifd = searchExecutionContext.getForField(keywordDelegate, MappedFieldType.FielddataOperation.SEARCH); return docValuesFieldFetcher(ifd); } else { assert false : "multi field should either be stored or have doc values"; return sourceFieldFetcher(searchExecutionContext); } } `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 10 __ Why: The suggestion identifies a critical logic bug in `delegateFieldFetcher` where it uses the wrong fallback field and an incorrect fetcher, which would cause ignored values not to be retrieved.	High
Possible issue	Deep-copy BytesRef to prevent mutation In the `add` method of `MultiValuedBinaryDocValuesField`, deep-copy the `BytesRef` value before adding it to the `uniqueValues` set to prevent issues arising from its mutability. server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java [1350-1355] public void add(final BytesRef value) { - if (uniqueValues.add(value)) { + // deep-copy the value before adding it to the set, as BytesRef is mutable + if (uniqueValues.add(BytesRef.deepCopyOf(value))) { // might as well track these on the go as opposed to having to loop through all entries later docValuesByteCount += value.length; } } Apply / Chat Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies that storing a mutable `BytesRef` in a `Set` is a potential source of bugs and proposes a deep copy to ensure immutability, which is a critical defensive programming practice.	Medium
High-level	Consolidate custom binary doc value implementations The PR adds two separate classes, `CustomBinaryDocValues` and `MultiValuedBinaryDocValuesField`, to handle a custom binary format. This logic should be merged into a single reusable utility to prevent code duplication. Examples: server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java [1339-1380] private static final class MultiValuedBinaryDocValuesField extends CustomDocValuesField { private final Set<BytesRef> uniqueValues; private int docValuesByteCount = 0; MultiValuedBinaryDocValuesField(String name) { super(name); // linked hash set to maintain insertion order of elements uniqueValues = new LinkedHashSet<>(); } ... (clipped 32 lines) modules/mapper-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java [798-835] private static class CustomBinaryDocValues extends SortedBinaryDocValues { private final BinaryDocValues binaryDocValues; private final ByteArrayStreamInput stream; private int docValueCount = 0; CustomBinaryDocValues(BinaryDocValues binaryDocValues) { this.binaryDocValues = binaryDocValues; this.stream = new ByteArrayStreamInput(); ... (clipped 28 lines) Solution Walkthrough: Before: // In KeywordFieldMapper.java private static final class MultiValuedBinaryDocValuesField extends CustomDocValuesField { // ... logic to ENCODE multiple values into one BytesRef // Format: [count][len1][value1][len2][value2]... public BytesRef binaryValue() { // ... } } // In MatchOnlyTextFieldMapper.java private static class CustomBinaryDocValues extends SortedBinaryDocValues { // ... logic to DECODE one BytesRef into multiple values // Format: [count][len1][value1][len2][value2]... public boolean advanceExact(int docId) { // ... } public BytesRef nextValue() { // ... } } After: // In a new shared utility class, e.g., o.e.index.mapper.BinaryDocValuesUtils public final class BinaryDocValuesUtils { public static BytesRef encode(Collection<BytesRef> values) { // ... shared encoding logic } public static void decode(BytesRef encoded, Consumer<BytesRef> consumer) { // ... shared decoding logic } } // In KeywordFieldMapper.java // MultiValuedBinaryDocValuesField now uses the utility for encoding. // In MatchOnlyTextFieldMapper.java // CustomBinaryDocValues now uses the utility for decoding. Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies significant code duplication for a custom binary format's encoding/decoding logic across `KeywordFieldMapper` and `MatchOnlyTextFieldMapper`, which impacts maintainability.	Medium
General	Fix incorrect stream size calculation In the `binaryValue` method, remove the inaccurate pre-sizing of `BytesStreamOutput` and simplify the value writing loop by using `out.writeBytesRef(value)`. server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java [1362-1378] public BytesRef binaryValue() { - int docValuesCount = uniqueValues.size(); - // the + 1 is for the total doc values count, which is prefixed at the start of the array - int streamSize = docValuesByteCount + (docValuesCount + 1) * Integer.BYTES; - - try (BytesStreamOutput out = new BytesStreamOutput(streamSize)) { - out.writeVInt(docValuesCount); + try (BytesStreamOutput out = new BytesStreamOutput()) { + out.writeVInt(uniqueValues.size()); for (BytesRef value : uniqueValues) { - int valueLength = value.length; - out.writeVInt(valueLength); - out.writeBytes(value.bytes, value.offset, valueLength); + out.writeBytesRef(value); } return out.bytes().toBytesRef(); } catch (IOException e) { throw new ElasticsearchException("Failed to get binary value", e); } } Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion correctly points out an inaccurate buffer pre-allocation, which is a performance concern, and proposes a simpler, more robust implementation that also improves code clarity.	Low
More

Apply changes for benchmark PR

2497292

qodo-code-review bot added the Review effort 3/5 label Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CLEAN] Synthetic Benchmark PR #137483 - Store keyword fields that trip ignore_above in binary doc values #27

[CLEAN] Synthetic Benchmark PR #137483 - Store keyword fields that trip ignore_above in binary doc values #27

Uh oh!

tomerqodo commented Dec 4, 2025 •

edited by qodo-code-review bot

Loading

Uh oh!

qodo-code-review bot commented Dec 4, 2025

Uh oh!

qodo-code-review bot commented Dec 4, 2025

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CLEAN] Synthetic Benchmark PR #137483 - Store keyword fields that trip ignore_above in binary doc values #27

Are you sure you want to change the base?

[CLEAN] Synthetic Benchmark PR #137483 - Store keyword fields that trip ignore_above in binary doc values #27

Uh oh!

Conversation

tomerqodo commented Dec 4, 2025 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Benchmark PR elastic#137483

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-code-review bot commented Dec 4, 2025

PR Compliance Guide 🔍

Uh oh!

qodo-code-review bot commented Dec 4, 2025

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomerqodo commented Dec 4, 2025 •

edited by qodo-code-review bot

Loading