Skip to content

Conversation

@tomerqodo
Copy link

@tomerqodo tomerqodo commented Dec 4, 2025

User description

Benchmark PR elastic#137483

Type: Clean (correct implementation)

Original PR Title: Store keyword fields that trip ignore_above in binary doc values
Original PR Description: https://github.com/elastic/logs-program/issues/13
Original PR URL: elastic#137483


PR Type

Enhancement


Description

  • Store ignored keyword values in binary doc values instead of stored fields

  • Add BinaryDocValuesSyntheticFieldLoaderLayer for reading binary doc values

  • Implement CustomBinaryDocValues wrapper for decoding binary doc values

  • Refactor synthetic source field fetchers to use binary doc values

  • Deduplicate ignored values using MultiValuedBinaryDocValuesField


Diagram Walkthrough

flowchart LR
  A["Ignored Keyword Values"] -->|Previously: StoredField| B["Stored Fields"]
  A -->|Now: Binary DocValues| C["BinaryDocValuesSyntheticFieldLoaderLayer"]
  C --> D["CustomBinaryDocValues Wrapper"]
  D --> E["Synthetic Source"]
  F["MultiValuedBinaryDocValuesField"] -->|Deduplicates| C
Loading

File Walkthrough

Relevant files
Enhancement
MatchOnlyTextFieldMapper.java
Refactor to use binary doc values for ignored values         

modules/mapper-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java

  • Add imports for BinaryDocValues, DocValues, ByteArrayStreamInput, and
    SortedBinaryDocValues
  • Refactor parentFieldFetcher() to use
    ignoredValuesDocValuesFieldFetcher() instead of storedFieldFetcher()
  • Refactor delegateFieldFetcher() to simplify field name handling and
    use binary doc values
  • Extract getValuesFromDocValues() method to decode values from
    SortedBinaryDocValues
  • Add ignoredValuesDocValuesFieldFetcher() method to read ignored values
    from binary doc values
  • Add CustomBinaryDocValues inner class to wrap BinaryDocValues and
    expose SortedBinaryDocValues interface
+82/-28 
BinaryDocValuesSyntheticFieldLoaderLayer.java
New layer for reading binary doc values                                   

server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java

  • New file implementing CompositeSyntheticFieldLoader.DocValuesLayer
    interface
  • Reads binary doc values from a field and decodes them using
    ByteArrayStreamInput
  • Handles doc value format: [count][length1][value1][length2][value2]...
  • Provides methods to write decoded values to XContentBuilder and check
    value existence
+84/-0   
KeywordFieldMapper.java
Store ignored values in binary doc values field                   

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

  • Add imports for ElasticsearchException, BytesStreamOutput, and
    LinkedHashSet
  • Replace StoredField with MultiValuedBinaryDocValuesField for storing
    ignored values
  • Replace StoredFieldLayer with BinaryDocValuesSyntheticFieldLoaderLayer
    in synthetic source support
  • Add MultiValuedBinaryDocValuesField inner class to store and encode
    multiple binary values
  • Implement deduplication of ignored values using LinkedHashSet
  • Encode values as binary doc values with count prefix and
    length-prefixed entries
+59/-8   
NativeArrayIntegrationTestCase.java
Collapse single-element arrays in source                                 

test/framework/src/main/java/org/elasticsearch/index/mapper/NativeArrayIntegrationTestCase.java

  • Modify arrayToSource() to collapse single-element arrays into scalar
    fields
  • Skip synthesizing source for arrays containing only null values
  • Maintain array format for multi-element arrays
+10/-4   
WildcardFieldMapper.java
Use shared binary doc values layer                                             

x-pack/plugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java

  • Remove unused imports: BinaryDocValues, LeafReader,
    ByteArrayStreamInput
  • Add import for BinaryDocValuesSyntheticFieldLoaderLayer
  • Replace WildcardSyntheticFieldLoader with
    BinaryDocValuesSyntheticFieldLoaderLayer
  • Remove WildcardSyntheticFieldLoader inner class implementation
+2/-54   
Tests
KeywordSyntheticSourceNativeArrayIntegrationTests.java
Update test for deduplicated ignored values                           

server/src/test/java/org/elasticsearch/index/mapper/KeywordSyntheticSourceNativeArrayIntegrationTests.java

  • Add blank line for formatting
  • Update test to verify deduplication of ignored values in synthetic
    source
  • Add expectedArrayValues to show deduplicated results
  • Update test assertion to use new expected values
+11/-1   
10_basic.yml
Update test data for deduplication                                             

modules/mapper-extras/src/yamlRestTest/resources/rest-api-spec/test/match_only_text/10_basic.yml

  • Add duplicate value "Apache Lucene" to test input array
  • Update expected synthetic source output to include the duplicate value
+2/-2     

@qodo-code-review
Copy link

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
No auditing: Newly added logic reads and writes binary doc values without emitting any audit logs for
critical actions, making it unclear whether sensitive reads/writes are traceable.

Referred Code
@Override
public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException {
    BinaryDocValues docValues = leafReader.getBinaryDocValues(fieldName);

    // there are no values associated with this field
    if (docValues == null) {
        valueCount = 0;
        return null;
    }

    return docId -> {
        // there are no more documents to process
        if (docValues.advanceExact(docId) == false) {
            valueCount = 0;
            return false;
        }

        // otherwise, extract the doc values into a stream to later read from
        BytesRef docValuesBytes = docValues.binaryValue();
        stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length);
        valueCount = stream.readVInt();


 ... (clipped 13 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Error handling gaps: Newly introduced doc values reading and decoding paths lack explicit handling/logging for
malformed data, IO failures, or unexpected counts, relying on upstream exceptions without
contextual messages.

Referred Code
private IOFunction<LeafReaderContext, CheckedIntFunction<List<Object>, IOException>> ignoredValuesDocValuesFieldFetcher(
    String fieldName
) {
    return context -> {
        CustomBinaryDocValues ignoredValuesDocValues = new CustomBinaryDocValues(DocValues.getBinary(context.reader(), fieldName));
        return docId -> getValuesFromDocValues(ignoredValuesDocValues, docId);
    };
}

private List<Object> getValuesFromDocValues(SortedBinaryDocValues docValues, int docId) throws IOException {
    if (docValues.advanceExact(docId)) {
        var values = new ArrayList<>(docValues.docValueCount());
        for (int i = 0; i < docValues.docValueCount(); i++) {
            values.add(docValues.nextValue().utf8ToString());
        }
        return values;
    } else {
        return List.of();
    }
}

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unvalidated decode: Decoding of length-prefixed binary doc values trusts encoded counts/lengths without
explicit bounds checks in this layer, which could risk misuse if upstream contracts
change.

Referred Code
@Override
public void write(XContentBuilder b) throws IOException {
    for (int i = 0; i < valueCount; i++) {
        // this function already knows how to decode the underlying bytes array, so no need to explicitly call VInt()
        BytesRef valueBytes = stream.readBytesRef();
        b.value(valueBytes.utf8ToString());
    }

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix incorrect ignored value fetching

Fix the delegateFieldFetcher method to correctly retrieve ignored values by
using the keywordDelegate's fallback field name and the
ignoredValuesDocValuesFieldFetcher.

modules/mapper-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java [332-359]

 private IOFunction<LeafReaderContext, CheckedIntFunction<List<Object>, IOException>> delegateFieldFetcher(
     final SearchExecutionContext searchExecutionContext,
     final KeywordFieldMapper.KeywordFieldType keywordDelegate
 ) {
     if (keywordDelegate.ignoreAbove().valuesPotentiallyIgnored()) {
         String delegateFieldName = keywordDelegate.name();
-        // bc we don't know whether the delegate will ignore a value, we must also check the fallback field created by this
-        // match_only_text field
-        String fallbackName = syntheticSourceFallbackFieldName();
+        String delegateFallbackName = keywordDelegate.syntheticSourceFallbackFieldName();
 
         if (keywordDelegate.isStored()) {
-            return storedFieldFetcher(delegateFieldName, fallbackName);
+            return combineFieldFetchers(storedFieldFetcher(delegateFieldName), ignoredValuesDocValuesFieldFetcher(delegateFallbackName));
         } else if (keywordDelegate.hasDocValues()) {
             var ifd = searchExecutionContext.getForField(keywordDelegate, MappedFieldType.FielddataOperation.SEARCH);
-            return combineFieldFetchers(docValuesFieldFetcher(ifd), storedFieldFetcher(fallbackName));
+            return combineFieldFetchers(docValuesFieldFetcher(ifd), ignoredValuesDocValuesFieldFetcher(delegateFallbackName));
         }
     }
 
     if (keywordDelegate.isStored()) {
         return storedFieldFetcher(keywordDelegate.name());
     } else if (keywordDelegate.hasDocValues()) {
         var ifd = searchExecutionContext.getForField(keywordDelegate, MappedFieldType.FielddataOperation.SEARCH);
         return docValuesFieldFetcher(ifd);
     } else {
         assert false : "multi field should either be stored or have doc values";
         return sourceFieldFetcher(searchExecutionContext);
     }
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 10

__

Why: The suggestion identifies a critical logic bug in delegateFieldFetcher where it uses the wrong fallback field and an incorrect fetcher, which would cause ignored values not to be retrieved.

High
Deep-copy BytesRef to prevent mutation

In the add method of MultiValuedBinaryDocValuesField, deep-copy the BytesRef
value before adding it to the uniqueValues set to prevent issues arising from
its mutability.

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java [1350-1355]

 public void add(final BytesRef value) {
-    if (uniqueValues.add(value)) {
+    // deep-copy the value before adding it to the set, as BytesRef is mutable
+    if (uniqueValues.add(BytesRef.deepCopyOf(value))) {
         // might as well track these on the go as opposed to having to loop through all entries later
         docValuesByteCount += value.length;
     }
 }
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that storing a mutable BytesRef in a Set is a potential source of bugs and proposes a deep copy to ensure immutability, which is a critical defensive programming practice.

Medium
High-level
Consolidate custom binary doc value implementations

The PR adds two separate classes, CustomBinaryDocValues and
MultiValuedBinaryDocValuesField, to handle a custom binary format. This logic
should be merged into a single reusable utility to prevent code duplication.

Examples:

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java [1339-1380]
    private static final class MultiValuedBinaryDocValuesField extends CustomDocValuesField {

        private final Set<BytesRef> uniqueValues;
        private int docValuesByteCount = 0;

        MultiValuedBinaryDocValuesField(String name) {
            super(name);
            // linked hash set to maintain insertion order of elements
            uniqueValues = new LinkedHashSet<>();
        }

 ... (clipped 32 lines)
modules/mapper-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java [798-835]
    private static class CustomBinaryDocValues extends SortedBinaryDocValues {

        private final BinaryDocValues binaryDocValues;
        private final ByteArrayStreamInput stream;

        private int docValueCount = 0;

        CustomBinaryDocValues(BinaryDocValues binaryDocValues) {
            this.binaryDocValues = binaryDocValues;
            this.stream = new ByteArrayStreamInput();

 ... (clipped 28 lines)

Solution Walkthrough:

Before:

// In KeywordFieldMapper.java
private static final class MultiValuedBinaryDocValuesField extends CustomDocValuesField {
    // ... logic to ENCODE multiple values into one BytesRef
    // Format: [count][len1][value1][len2][value2]...
    public BytesRef binaryValue() {
        // ...
    }
}

// In MatchOnlyTextFieldMapper.java
private static class CustomBinaryDocValues extends SortedBinaryDocValues {
    // ... logic to DECODE one BytesRef into multiple values
    // Format: [count][len1][value1][len2][value2]...
    public boolean advanceExact(int docId) {
        // ...
    }
    public BytesRef nextValue() {
        // ...
    }
}

After:

// In a new shared utility class, e.g., o.e.index.mapper.BinaryDocValuesUtils
public final class BinaryDocValuesUtils {
    public static BytesRef encode(Collection<BytesRef> values) {
        // ... shared encoding logic
    }

    public static void decode(BytesRef encoded, Consumer<BytesRef> consumer) {
        // ... shared decoding logic
    }
}

// In KeywordFieldMapper.java
// MultiValuedBinaryDocValuesField now uses the utility for encoding.

// In MatchOnlyTextFieldMapper.java
// CustomBinaryDocValues now uses the utility for decoding.
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies significant code duplication for a custom binary format's encoding/decoding logic across KeywordFieldMapper and MatchOnlyTextFieldMapper, which impacts maintainability.

Medium
General
Fix incorrect stream size calculation

In the binaryValue method, remove the inaccurate pre-sizing of BytesStreamOutput
and simplify the value writing loop by using out.writeBytesRef(value).

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java [1362-1378]

 public BytesRef binaryValue() {
-    int docValuesCount = uniqueValues.size();
-    // the + 1 is for the total doc values count, which is prefixed at the start of the array
-    int streamSize = docValuesByteCount + (docValuesCount + 1) * Integer.BYTES;
-
-    try (BytesStreamOutput out = new BytesStreamOutput(streamSize)) {
-        out.writeVInt(docValuesCount);
+    try (BytesStreamOutput out = new BytesStreamOutput()) {
+        out.writeVInt(uniqueValues.size());
         for (BytesRef value : uniqueValues) {
-            int valueLength = value.length;
-            out.writeVInt(valueLength);
-            out.writeBytes(value.bytes, value.offset, valueLength);
+            out.writeBytesRef(value);
         }
         return out.bytes().toBytesRef();
     } catch (IOException e) {
         throw new ElasticsearchException("Failed to get binary value", e);
     }
 }
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly points out an inaccurate buffer pre-allocation, which is a performance concern, and proposes a simpler, more robust implementation that also improves code clarity.

Low
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants