Skip to content

Conversation

@martijnvg
Copy link
Member

WIP Note: no tests yet, exploring how to implement bulk loading of dense singleton sorted (set) doc values.

This change targets reading field values in bulk mode at codec level when doc values type is sorted doc values or sorted set doc values, there is only one value per document, and the field is dense (all documents have a value).

Relates to #128445

minOrd = Math.min(minOrd, ord);
maxOrd = Math.max(maxOrd, ord);
}
builder.appendOrds(convertedOrds, 0, length, minOrd, maxOrd);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builder is backed by a int[] so we need to do a conversion here, because TSDBDocValuesEncoder is long[] based. But is there a better way? For example can we have builder that accepts long[]? And then in the build() do the conversion to int?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do this conversation somewhere. Maybe here is good enough.

valuesData.seek(indexReader.get(blockIndex));
}
currentBlockIndex = blockIndex;
decoder.decodeOrdinals(valuesData, currentBlock, bitsPerOrd);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is largely the same as the read(...) method except here (decoder.decodeOrdinals(valuesData, currentBlock, bitsPerOrd);) and at the end because the builder is different.

Ideally would like to see less code duplication here.

}

@Override
public BlockLoader.Block read(BlockLoader.BlockFactory factory, BlockLoader.Docs docs, int offset) throws IOException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordinals are implemented in the codec as numeric doc values. Builder gets created here, since we need to reference a SortedDocValues instance.

@martijnvg martijnvg force-pushed the compute_engine_improve_loading_dense_sorted_doc_values branch from c995f5e to 1a0e927 Compare August 12, 2025 09:55

@Override
public boolean supportsBlockRead() {
return ords instanceof BulkNumericDocValues;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on whether the field is empty, has a single unique value, sparse or dense we got another implementation here. Only the dense implementation support bulk loading, but that is why this supportsBlockRead() method exists.

…o tsdb codec.

This change targets reading field values in bulk mode at codec level when doc values type is sorted doc values or sorted set doc values, there is only one value per document, and the field is dense (all documents have a value).

Relates to elastic#128445
@martijnvg martijnvg force-pushed the compute_engine_improve_loading_dense_sorted_doc_values branch from 1a0e927 to f7ff379 Compare August 14, 2025 12:03
@martijnvg
Copy link
Member Author

I've locally benchmarking with the following query: FROM metrics-hostmetricsreceiver.otel-default | STATS count(*) BY host.name | LIMIT 10000. Without this change the total query time is ~260ms and with this change ~150ms.

When running the following profile query:

POST /_query
{
    "profile": true,
    "query": "FROM metrics-hostmetricsreceiver.otel-default | STATS count(*) BY host.name | LIMIT 10000",
    "pragma": {
        "data_partitioning": "shard"
    }
}

Without this change host.name value loading takes:

{
    "operator": "ValuesSourceReaderOperator[fields = [host.name]]",
    "status": {
        "readers_built": {
            "host.name:column_at_a_time:BlockDocValuesReader.SingletonOrdinals": 58
        },
        "values_loaded": 221184000,
        "process_nanos": 1216242904, <-- ~1216 ms
        "pages_received": 45597,
        "pages_emitted": 45597,
        "rows_received": 221184000,
        "rows_emitted": 221184000
    }
}

Wit this change host.name value loading takes:

{
    "operator": "ValuesSourceReaderOperator[fields = [host.name]]",
    "status": {
        "readers_built": {
            "host.name:column_at_a_time:BlockDocValuesReader.SingletonOrdinals": 58
        },
        "values_loaded": 221184000,
        "process_nanos": 387031891, <-- 387 ms
        "pages_received": 45597,
        "pages_emitted": 45597,
        "rows_received": 221184000,
        "rows_emitted": 221184000
    }
}

@martijnvg martijnvg closed this Aug 15, 2025
martijnvg added a commit that referenced this pull request Aug 16, 2025
With this change both sorted set and number doc values use the same bulk loading for values/ordinals.

This PR supersedes (#132715) that also sped up loading dense singleton keyword fields, but duplicated the bulk encoding logic.
javanna pushed a commit to javanna/elasticsearch that referenced this pull request Aug 18, 2025
With this change both sorted set and number doc values use the same bulk loading for values/ordinals.

This PR supersedes (elastic#132715) that also sped up loading dense singleton keyword fields, but duplicated the bulk encoding logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants