Skip to content

Conversation

@tomerqodo
Copy link

@tomerqodo tomerqodo commented Dec 4, 2025

User description

Benchmark PR elastic#137483

Type: Corrupted (contains bugs)

Original PR Title: Store keyword fields that trip ignore_above in binary doc values
Original PR Description: https://github.com/elastic/logs-program/issues/13
Original PR URL: elastic#137483


PR Type

Enhancement, Bug fix


Description

  • Store ignored keyword values in binary doc values instead of stored fields

  • Add BinaryDocValuesSyntheticFieldLoaderLayer for reading binary doc values during synthetic source generation

  • Create MultiValuedBinaryDocValuesField to encode multiple values as single binary array

  • Refactor MatchOnlyTextFieldMapper to use binary doc values for ignored values

  • Add CustomBinaryDocValues wrapper to decode binary doc values with quality-of-life functions

  • Simplify field fetcher logic by removing redundant stored field fallbacks

  • Update WildcardFieldMapper to use shared BinaryDocValuesSyntheticFieldLoaderLayer


Diagram Walkthrough

flowchart LR
  A["Ignored Keyword Values"] -->|"Encode as Binary"| B["MultiValuedBinaryDocValuesField"]
  B -->|"Store in Index"| C["Binary Doc Values"]
  C -->|"Read via Layer"| D["BinaryDocValuesSyntheticFieldLoaderLayer"]
  D -->|"Decode Values"| E["Synthetic Source"]
  F["MatchOnlyTextFieldMapper"] -->|"Use"| D
  G["WildcardFieldMapper"] -->|"Use"| D
Loading

File Walkthrough

Relevant files
Enhancement
MatchOnlyTextFieldMapper.java
Refactor to use binary doc values for ignored values         

modules/mapper-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java

  • Add imports for binary doc values and stream utilities
  • Refactor parentFieldFetcher() to use
    ignoredValuesDocValuesFieldFetcher() instead of stored field fallback
  • Refactor delegateFieldFetcher() to simplify field name handling and
    use binary doc values fetcher
  • Extract getValuesFromDocValues() helper method to decode doc values
    consistently
  • Add ignoredValuesDocValuesFieldFetcher() to read ignored values from
    binary doc values field
  • Add CustomBinaryDocValues inner class to wrap BinaryDocValues and
    expose SortedBinaryDocValues interface
+82/-28 
BinaryDocValuesSyntheticFieldLoaderLayer.java
New layer for reading binary doc values in synthetic source

server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java

  • New file implementing CompositeSyntheticFieldLoader.DocValuesLayer
    interface
  • Reads binary doc values from a specified field name during synthetic
    source generation
  • Decodes binary array format: [doc value count][length of value
    1][value 1]...
  • Provides DocValuesLoader to iterate through documents and extract
    values
  • Implements write() method to output decoded values to XContentBuilder
+85/-0   
KeywordFieldMapper.java
Store ignored values in binary doc values field                   

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

  • Add imports for ElasticsearchException and BytesStreamOutput
  • Add import for LinkedHashSet to maintain insertion order
  • Modify indexValue() to store ignored values in
    MultiValuedBinaryDocValuesField instead of StoredField
  • Replace StoredFieldLayer with BinaryDocValuesSyntheticFieldLoaderLayer
    in synthetic source support
  • Add MultiValuedBinaryDocValuesField inner class that encodes multiple
    values as single binary array
  • Implement binaryValue() to serialize values with count and length
    prefixes
+59/-9   
NativeArrayIntegrationTestCase.java
Collapse single-element arrays to scalars                               

test/framework/src/main/java/org/elasticsearch/index/mapper/NativeArrayIntegrationTestCase.java

  • Modify arrayToSource() to collapse single-element arrays into scalar
    fields
  • Skip field synthesis if single element is null
  • Maintain array format for multi-element or null-only arrays
+10/-4   
WildcardFieldMapper.java
Use shared binary doc values layer implementation               

x-pack/plugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java

  • Remove unused imports for BinaryDocValues, LeafReader, and
    ByteArrayStreamInput
  • Add import for BinaryDocValuesSyntheticFieldLoaderLayer
  • Replace WildcardSyntheticFieldLoader with
    BinaryDocValuesSyntheticFieldLoaderLayer
  • Remove WildcardSyntheticFieldLoader inner class implementation
+2/-54   
Tests
KeywordSyntheticSourceNativeArrayIntegrationTests.java
Update test for deduplicated ignored values                           

server/src/test/java/org/elasticsearch/index/mapper/KeywordSyntheticSourceNativeArrayIntegrationTests.java

  • Add blank line after mapping definition for readability
  • Update test to expect deduplicated values in synthetic source output
  • Add expectedArrayValues array showing deduplicated results
  • Change test call to pass both actual and expected arrays
+11/-1   
10_basic.yml
Update test data for deduplication behavior                           

modules/mapper-extras/src/yamlRestTest/resources/rest-api-spec/test/match_only_text/10_basic.yml

  • Add duplicate value "Apache Lucene" to test input array
  • Update expected synthetic source output to include the duplicate value
+2/-2     

@qodo-code-review
Copy link

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Denial of service

Description: Unbounded accumulation of unique binary values in MultiValuedBinaryDocValuesField allows a
single document to store an arbitrary number/size of ignored keyword values, potentially
causing excessive memory usage or large doc values blobs leading to DoS during indexing or
synthetic source generation; there are no limits or guards on count, total byte size, or
duplicate handling beyond a LinkedHashSet.
KeywordFieldMapper.java [1349-1377]

Referred Code
public void add(final BytesRef value) {
    // might as well track these on the go as opposed to having to loop through all entries later
    docValuesByteCount += value.length;
    uniqueValues.add(value);
}

/**
 * Encodes the collection of binary doc values as a single contiguous binary array, wrapped in {@link BytesRef}. This array takes
 * the form of [doc value count][length of value 1][value 1][length of value 2][value 2]...
 */
@Override
public BytesRef binaryValue() {
    int docValuesCount = uniqueValues.size();
    // the + 1 is for the total doc values count, which is prefixed at the start of the array
    int streamSize = docValuesByteCount + docValuesCount * Integer.BYTES;

    try (BytesStreamOutput out = new BytesStreamOutput(streamSize)) {
        out.writeVInt(docValuesCount);
        for (BytesRef value : uniqueValues) {
            int valueLength = value.length;


 ... (clipped 8 lines)
Untrusted data parsing

Description: The decoding loop reads valueCount items from a doc values buffer without validating
lengths beyond readVInt() which trusts on-disk data; if the stored array is corrupted or
crafted, this can trigger excessive allocations or long loops when writing synthetic
source (e.g., very large length values), risking OOM or prolonged CPU usage.
BinaryDocValuesSyntheticFieldLoaderLayer.java [61-68]

Referred Code
@Override
public void write(XContentBuilder b) throws IOException {
    for (int i = 0; i < valueCount; i++) {
        // this function already knows how to decode the underlying bytes array, so no need to explicitly call VInt()
        BytesRef valueBytes = stream.readBytesRef();
        b.value(valueBytes.utf8ToString());
    }
}
Untrusted data parsing

Description: CustomBinaryDocValues decodes a single binary blob per document based solely on a leading
count and subsequent lengths, without enforcing sane upper bounds for docValueCount or
per-value length, making it susceptible to OOM or excessive processing if doc values are
corrupted or maliciously large.
MatchOnlyTextFieldMapper.java [810-834]

Referred Code
@Override
public BytesRef nextValue() throws IOException {
    // this function already knows how to decode the underlying bytes array, so no need to explicitly call VInt()
    return stream.readBytesRef();
}

@Override
public boolean advanceExact(int docId) throws IOException {
    // if document has a value, read underlying bytes
    if (binaryDocValues.advanceExact(docId)) {
        BytesRef docValuesBytes = binaryDocValues.binaryValue();
        stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length);
        docValueCount = stream.readVInt();
        return true;
    }

    // otherwise there is nothing to do
    docValueCount = 0;
    return false;
}



 ... (clipped 4 lines)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
No audit logs: The new code adds storage and decoding of ignored keyword values using binary doc values
without emitting any audit logs for these critical data handling actions.

Referred Code
public final class BinaryDocValuesSyntheticFieldLoaderLayer implements CompositeSyntheticFieldLoader.DocValuesLayer {

    private final String fieldName;

    // the binary doc values for a document are all encoded in a single binary array, which this stream knows how to read
    // the doc values in the array take the form of [doc value count][length of value 1][value 1][length of value 2][value 2]...
    private final ByteArrayStreamInput stream;
    private int valueCount;

    public BinaryDocValuesSyntheticFieldLoaderLayer(String fieldName) {
        this.fieldName = fieldName;
        this.stream = new ByteArrayStreamInput();
    }

    @Override
    public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException {
        BinaryDocValues docValues = leafReader.getBinaryDocValues(fieldName);

        // there are no values associated with this field
        if (docValues == null) {
            valueCount = 0;


 ... (clipped 45 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Edge cases unhandled: Doc values decoding assumes well-formed binary format and does not validate lengths or
handle corrupted inputs beyond returning false, which may cause silent failures.

Referred Code
@Override
public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException {
    BinaryDocValues docValues = leafReader.getBinaryDocValues(fieldName);

    // there are no values associated with this field
    if (docValues == null) {
        valueCount = 0;
        return null;
    }

    return docId -> {
        // there are no more documents to process
        if (docValues.advanceExact(docId) == false) {
            valueCount = 0;
            return false;
        }

        // otherwise, extract the doc values into a stream to later read from
        BytesRef docValuesBytes = docValues.binaryValue();
        stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length);
        valueCount = stream.readVInt();


 ... (clipped 14 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Binary decode trust: The decoding of binary doc values trusts index data without bounds checks on lengths read
from the stream, which could risk OOM or corrupted reads if data is malformed.

Referred Code
        valueCount = stream.readVInt();
        stream.readVInt(); // skip first value length

        return hasValue();
    };
}

@Override
public void write(XContentBuilder b) throws IOException {
    for (int i = 0; i < valueCount; i++) {
        // this function already knows how to decode the underlying bytes array, so no need to explicitly call VInt()
        BytesRef valueBytes = stream.readBytesRef();
        b.value(valueBytes.utf8ToString());
    }
}

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
The custom binary format is flawed

The reader for the new custom binary format,
BinaryDocValuesSyntheticFieldLoaderLayer, contains a bug where it incorrectly
skips the length of the first value. This causes deserialization to fail and
must be fixed.

Examples:

server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java [55]
            stream.readVInt(); // skip first value length
server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java [1366-1373]
            try (BytesStreamOutput out = new BytesStreamOutput(streamSize)) {
                out.writeVInt(docValuesCount);
                for (BytesRef value : uniqueValues) {
                    int valueLength = value.length;
                    out.writeVInt(valueLength);
                    out.writeBytes(value.bytes, value.offset, valueLength);
                }
                return out.bytes().toBytesRef();

Solution Walkthrough:

Before:

// server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java

public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException {
    // ...
    return docId -> {
        // ...
        BytesRef docValuesBytes = docValues.binaryValue();
        stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length);
        valueCount = stream.readVInt();
        stream.readVInt(); // BUG: skip first value length

        return hasValue();
    };
}

After:

// server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java

public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException {
    // ...
    return docId -> {
        // ...
        BytesRef docValuesBytes = docValues.binaryValue();
        stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length);
        valueCount = stream.readVInt();
        // FIX: Do not skip the length of the first value.
        // The write() method's loop will correctly read each length-prefixed value.

        return hasValue();
    };
}
Suggestion importance[1-10]: 10

__

Why: This suggestion correctly identifies a critical bug in the new BinaryDocValuesSyntheticFieldLoaderLayer that causes data corruption during deserialization, rendering the synthetic source feature unusable for fields that rely on it.

High
Possible issue
Fix incorrect stream position handling

Remove the incorrect call to stream.readVInt() in docValuesLoader. This call
prematurely advances the stream pointer, which would cause data corruption when
reading values in the write method.

server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java [52-57]

 BytesRef docValuesBytes = docValues.binaryValue();
 stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length);
 valueCount = stream.readVInt();
-stream.readVInt(); // skip first value length
 
 return hasValue();
  • Apply / Chat
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a bug where stream.readVInt() is called unnecessarily, which would cause stream.readBytesRef() in the write method to fail by misinterpreting value bytes as a length prefix. This is a critical bug fix.

High
Fix potential buffer overflow issue

In binaryValue, instantiate BytesStreamOutput without a pre-allocated size to
prevent potential buffer overflows. The current size calculation for VInt is
incorrect and could lead to an ArrayIndexOutOfBoundsException.

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java [1361-1377]

 public BytesRef binaryValue() {
-    int docValuesCount = uniqueValues.size();
-    // the + 1 is for the total doc values count, which is prefixed at the start of the array
-    int streamSize = docValuesByteCount + docValuesCount * Integer.BYTES;
-
-    try (BytesStreamOutput out = new BytesStreamOutput(streamSize)) {
-        out.writeVInt(docValuesCount);
+    try (BytesStreamOutput out = new BytesStreamOutput()) {
+        out.writeVInt(uniqueValues.size());
         for (BytesRef value : uniqueValues) {
-            int valueLength = value.length;
-            out.writeVInt(valueLength);
-            out.writeBytes(value.bytes, value.offset, valueLength);
+            out.writeVInt(value.length);
+            out.writeBytes(value.bytes, value.offset, value.length);
         }
         return out.bytes().toBytesRef();
     } catch (IOException e) {
         throw new ElasticsearchException("Failed to get binary value", e);
     }
 }
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that pre-allocating the BytesStreamOutput with a fixed size based on Integer.BYTES for VInt lengths is a potential bug. Using a dynamically-sized stream is more robust and prevents potential ArrayIndexOutOfBoundsException.

Medium
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants