[CORRUPTED] Synthetic Benchmark PR #137483 - Store keyword fields that trip ignore_above in binary doc values #28

tomerqodo · 2025-12-04T20:37:14Z

User description

Benchmark PR elastic#137483

Type: Corrupted (contains bugs)

Original PR Title: Store keyword fields that trip ignore_above in binary doc values
Original PR Description: https://github.com/elastic/logs-program/issues/13
Original PR URL: elastic#137483

PR Type

Enhancement, Bug fix

Description

Store ignored keyword values in binary doc values instead of stored fields
Add BinaryDocValuesSyntheticFieldLoaderLayer for reading binary doc values during synthetic source generation
Create MultiValuedBinaryDocValuesField to encode multiple values as single binary array
Refactor MatchOnlyTextFieldMapper to use binary doc values for ignored values
Add CustomBinaryDocValues wrapper to decode binary doc values with quality-of-life functions
Simplify field fetcher logic by removing redundant stored field fallbacks
Update WildcardFieldMapper to use shared BinaryDocValuesSyntheticFieldLoaderLayer

Diagram Walkthrough

flowchart LR
  A["Ignored Keyword Values"] -->|"Encode as Binary"| B["MultiValuedBinaryDocValuesField"]
  B -->|"Store in Index"| C["Binary Doc Values"]
  C -->|"Read via Layer"| D["BinaryDocValuesSyntheticFieldLoaderLayer"]
  D -->|"Decode Values"| E["Synthetic Source"]
  F["MatchOnlyTextFieldMapper"] -->|"Use"| D
  G["WildcardFieldMapper"] -->|"Use"| D

File Walkthrough

Relevant files

Enhancement

MatchOnlyTextFieldMapper.java `Refactor to use binary doc values for ignored values` modules/mapper-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java Add imports for binary doc values and stream utilities Refactor `parentFieldFetcher()` to use `ignoredValuesDocValuesFieldFetcher()` instead of stored field fallback Refactor `delegateFieldFetcher()` to simplify field name handling and use binary doc values fetcher Extract `getValuesFromDocValues()` helper method to decode doc values consistently Add `ignoredValuesDocValuesFieldFetcher()` to read ignored values from binary doc values field Add `CustomBinaryDocValues` inner class to wrap `BinaryDocValues` and expose `SortedBinaryDocValues` interface	+82/-28
BinaryDocValuesSyntheticFieldLoaderLayer.java `New layer for reading binary doc values in synthetic source` server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java New file implementing `CompositeSyntheticFieldLoader.DocValuesLayer` interface Reads binary doc values from a specified field name during synthetic source generation Decodes binary array format: `[doc value count][length of value` `1][value 1]...` Provides `DocValuesLoader` to iterate through documents and extract values Implements `write()` method to output decoded values to `XContentBuilder`	+85/-0
KeywordFieldMapper.java `Store ignored values in binary doc values field` server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java Add imports for `ElasticsearchException` and `BytesStreamOutput` Add import for `LinkedHashSet` to maintain insertion order Modify `indexValue()` to store ignored values in `MultiValuedBinaryDocValuesField` instead of `StoredField` Replace `StoredFieldLayer` with `BinaryDocValuesSyntheticFieldLoaderLayer` in synthetic source support Add `MultiValuedBinaryDocValuesField` inner class that encodes multiple values as single binary array Implement `binaryValue()` to serialize values with count and length prefixes	+59/-9
NativeArrayIntegrationTestCase.java `Collapse single-element arrays to scalars` test/framework/src/main/java/org/elasticsearch/index/mapper/NativeArrayIntegrationTestCase.java Modify `arrayToSource()` to collapse single-element arrays into scalar fields Skip field synthesis if single element is null Maintain array format for multi-element or null-only arrays	+10/-4
WildcardFieldMapper.java `Use shared binary doc values layer implementation` x-pack/plugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java Remove unused imports for `BinaryDocValues`, `LeafReader`, and `ByteArrayStreamInput` Add import for `BinaryDocValuesSyntheticFieldLoaderLayer` Replace `WildcardSyntheticFieldLoader` with `BinaryDocValuesSyntheticFieldLoaderLayer` Remove `WildcardSyntheticFieldLoader` inner class implementation	+2/-54

Tests

KeywordSyntheticSourceNativeArrayIntegrationTests.java `Update test for deduplicated ignored values` server/src/test/java/org/elasticsearch/index/mapper/KeywordSyntheticSourceNativeArrayIntegrationTests.java Add blank line after mapping definition for readability Update test to expect deduplicated values in synthetic source output Add `expectedArrayValues` array showing deduplicated results Change test call to pass both actual and expected arrays	+11/-1
10_basic.yml `Update test data for deduplication behavior` modules/mapper-extras/src/yamlRestTest/resources/rest-api-spec/test/match_only_text/10_basic.yml Add duplicate value `"Apache Lucene"` to test input array Update expected synthetic source output to include the duplicate value	+2/-2

…r function

…nore-above

qodo-code-review · 2025-12-04T20:37:48Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Denial of service Description: Unbounded accumulation of unique binary values in `MultiValuedBinaryDocValuesField` allows a single document to store an arbitrary number/size of ignored keyword values, potentially causing excessive memory usage or large doc values blobs leading to DoS during indexing or synthetic source generation; there are no limits or guards on count, total byte size, or duplicate handling beyond a LinkedHashSet. KeywordFieldMapper.java [1349-1377] Referred Code public void add(final BytesRef value) { // might as well track these on the go as opposed to having to loop through all entries later docValuesByteCount += value.length; uniqueValues.add(value); } /** * Encodes the collection of binary doc values as a single contiguous binary array, wrapped in {@link BytesRef}. This array takes * the form of [doc value count][length of value 1][value 1][length of value 2][value 2]... / @Override public BytesRef binaryValue() { int docValuesCount = uniqueValues.size(); // the + 1 is for the total doc values count, which is prefixed at the start of the array int streamSize = docValuesByteCount + docValuesCount Integer.BYTES; try (BytesStreamOutput out = new BytesStreamOutput(streamSize)) { out.writeVInt(docValuesCount); for (BytesRef value : uniqueValues) { int valueLength = value.length; ... (clipped 8 lines)
	Untrusted data parsing Description: The decoding loop reads `valueCount` items from a doc values buffer without validating lengths beyond `readVInt()` which trusts on-disk data; if the stored array is corrupted or crafted, this can trigger excessive allocations or long loops when writing synthetic source (e.g., very large length values), risking OOM or prolonged CPU usage. BinaryDocValuesSyntheticFieldLoaderLayer.java [61-68] Referred Code @Override public void write(XContentBuilder b) throws IOException { for (int i = 0; i < valueCount; i++) { // this function already knows how to decode the underlying bytes array, so no need to explicitly call VInt() BytesRef valueBytes = stream.readBytesRef(); b.value(valueBytes.utf8ToString()); } }
	Untrusted data parsing Description: `CustomBinaryDocValues` decodes a single binary blob per document based solely on a leading count and subsequent lengths, without enforcing sane upper bounds for `docValueCount` or per-value length, making it susceptible to OOM or excessive processing if doc values are corrupted or maliciously large. MatchOnlyTextFieldMapper.java [810-834] Referred Code @Override public BytesRef nextValue() throws IOException { // this function already knows how to decode the underlying bytes array, so no need to explicitly call VInt() return stream.readBytesRef(); } @Override public boolean advanceExact(int docId) throws IOException { // if document has a value, read underlying bytes if (binaryDocValues.advanceExact(docId)) { BytesRef docValuesBytes = binaryDocValues.binaryValue(); stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length); docValueCount = stream.readVInt(); return true; } // otherwise there is nothing to do docValueCount = 0; return false; } ... (clipped 4 lines)
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: No audit logs: The new code adds storage and decoding of ignored keyword values using binary doc values without emitting any audit logs for these critical data handling actions. Referred Code public final class BinaryDocValuesSyntheticFieldLoaderLayer implements CompositeSyntheticFieldLoader.DocValuesLayer { private final String fieldName; // the binary doc values for a document are all encoded in a single binary array, which this stream knows how to read // the doc values in the array take the form of [doc value count][length of value 1][value 1][length of value 2][value 2]... private final ByteArrayStreamInput stream; private int valueCount; public BinaryDocValuesSyntheticFieldLoaderLayer(String fieldName) { this.fieldName = fieldName; this.stream = new ByteArrayStreamInput(); } @Override public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException { BinaryDocValues docValues = leafReader.getBinaryDocValues(fieldName); // there are no values associated with this field if (docValues == null) { valueCount = 0; ... (clipped 45 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Edge cases unhandled: Doc values decoding assumes well-formed binary format and does not validate lengths or handle corrupted inputs beyond returning false, which may cause silent failures. Referred Code @Override public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException { BinaryDocValues docValues = leafReader.getBinaryDocValues(fieldName); // there are no values associated with this field if (docValues == null) { valueCount = 0; return null; } return docId -> { // there are no more documents to process if (docValues.advanceExact(docId) == false) { valueCount = 0; return false; } // otherwise, extract the doc values into a stream to later read from BytesRef docValuesBytes = docValues.binaryValue(); stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length); valueCount = stream.readVInt(); ... (clipped 14 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Binary decode trust: The decoding of binary doc values trusts index data without bounds checks on lengths read from the stream, which could risk OOM or corrupted reads if data is malformed. Referred Code valueCount = stream.readVInt(); stream.readVInt(); // skip first value length return hasValue(); }; } @Override public void write(XContentBuilder b) throws IOException { for (int i = 0; i < valueCount; i++) { // this function already knows how to decode the underlying bytes array, so no need to explicitly call VInt() BytesRef valueBytes = stream.readBytesRef(); b.value(valueBytes.utf8ToString()); } } Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2025-12-04T20:38:50Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	The custom binary format is flawed The reader for the new custom binary format, `BinaryDocValuesSyntheticFieldLoaderLayer`, contains a bug where it incorrectly skips the length of the first value. This causes deserialization to fail and must be fixed. Examples: server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java [55] stream.readVInt(); // skip first value length server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java [1366-1373] try (BytesStreamOutput out = new BytesStreamOutput(streamSize)) { out.writeVInt(docValuesCount); for (BytesRef value : uniqueValues) { int valueLength = value.length; out.writeVInt(valueLength); out.writeBytes(value.bytes, value.offset, valueLength); } return out.bytes().toBytesRef(); Solution Walkthrough: Before: // server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException { // ... return docId -> { // ... BytesRef docValuesBytes = docValues.binaryValue(); stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length); valueCount = stream.readVInt(); stream.readVInt(); // BUG: skip first value length return hasValue(); }; } After: // server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java public DocValuesLoader docValuesLoader(LeafReader leafReader, int[] docIdsInLeaf) throws IOException { // ... return docId -> { // ... BytesRef docValuesBytes = docValues.binaryValue(); stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length); valueCount = stream.readVInt(); // FIX: Do not skip the length of the first value. // The write() method's loop will correctly read each length-prefixed value. return hasValue(); }; } Suggestion importance[1-10]: 10 __ Why: This suggestion correctly identifies a critical bug in the new `BinaryDocValuesSyntheticFieldLoaderLayer` that causes data corruption during deserialization, rendering the synthetic source feature unusable for fields that rely on it.	High
Possible issue	Fix incorrect stream position handling Remove the incorrect call to `stream.readVInt()` in `docValuesLoader`. This call prematurely advances the stream pointer, which would cause data corruption when reading values in the `write` method. server/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java [52-57] BytesRef docValuesBytes = docValues.binaryValue(); stream.reset(docValuesBytes.bytes, docValuesBytes.offset, docValuesBytes.length); valueCount = stream.readVInt(); -stream.readVInt(); // skip first value length return hasValue(); Apply / Chat Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a bug where `stream.readVInt()` is called unnecessarily, which would cause `stream.readBytesRef()` in the `write` method to fail by misinterpreting value bytes as a length prefix. This is a critical bug fix.	High
Possible issue	Fix potential buffer overflow issue In `binaryValue`, instantiate `BytesStreamOutput` without a pre-allocated size to prevent potential buffer overflows. The current size calculation for `VInt` is incorrect and could lead to an `ArrayIndexOutOfBoundsException`. server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java [1361-1377] public BytesRef binaryValue() { - int docValuesCount = uniqueValues.size(); - // the + 1 is for the total doc values count, which is prefixed at the start of the array - int streamSize = docValuesByteCount + docValuesCount * Integer.BYTES; - - try (BytesStreamOutput out = new BytesStreamOutput(streamSize)) { - out.writeVInt(docValuesCount); + try (BytesStreamOutput out = new BytesStreamOutput()) { + out.writeVInt(uniqueValues.size()); for (BytesRef value : uniqueValues) { - int valueLength = value.length; - out.writeVInt(valueLength); - out.writeBytes(value.bytes, value.offset, valueLength); + out.writeVInt(value.length); + out.writeBytes(value.bytes, value.offset, value.length); } return out.bytes().toBytesRef(); } catch (IOException e) { throw new ElasticsearchException("Failed to get binary value", e); } } Apply / Chat Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out that pre-allocating the `BytesStreamOutput` with a fixed size based on `Integer.BYTES` for `VInt` lengths is a potential bug. Using a dynamically-sized stream is more robust and prevents potential `ArrayIndexOutOfBoundsException`.	Medium
More

Kubik42 and others added 9 commits November 10, 2025 09:02

Store keyword fields that trip ignore_above in binary doc values

5a48232

Addressed feedback

9d13387

Moved ignore values doc value field fetcher inside of existing fetche…

c2ee79e

…r function

Moved ignore values doc value field fetcher inside of existing fetche…

9bac251

…r function

Moved ignore values doc value field fetcher inside of existing fetche…

ab28946

…r function

Moved ignore values doc value field fetcher inside of existing fetche…

a1d1d2a

…r function

Merge branch 'main' into store-keyword-fields-in-binary-doc-values-ig…

ca174e6

…nore-above

Merge branch 'main' into store-keyword-fields-in-binary-doc-values-ig…

0620697

…nore-above

Apply changes for benchmark PR

ad95654

qodo-code-review bot added the Review effort 3/5 label Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CORRUPTED] Synthetic Benchmark PR #137483 - Store keyword fields that trip ignore_above in binary doc values #28

[CORRUPTED] Synthetic Benchmark PR #137483 - Store keyword fields that trip ignore_above in binary doc values #28

Uh oh!

tomerqodo commented Dec 4, 2025 •

edited by qodo-code-review bot

Loading

Uh oh!

qodo-code-review bot commented Dec 4, 2025

Uh oh!

qodo-code-review bot commented Dec 4, 2025

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[CORRUPTED] Synthetic Benchmark PR #137483 - Store keyword fields that trip ignore_above in binary doc values #28

Are you sure you want to change the base?

[CORRUPTED] Synthetic Benchmark PR #137483 - Store keyword fields that trip ignore_above in binary doc values #28

Uh oh!

Conversation

tomerqodo commented Dec 4, 2025 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Benchmark PR elastic#137483

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-code-review bot commented Dec 4, 2025

PR Compliance Guide 🔍

Uh oh!

qodo-code-review bot commented Dec 4, 2025

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tomerqodo commented Dec 4, 2025 •

edited by qodo-code-review bot

Loading