Skip to content

Partition rate query using tsid prefixes#144818

Open
dnhatn wants to merge 3 commits intoelastic:mainfrom
dnhatn:query-tsid-prefix-partitions
Open

Partition rate query using tsid prefixes#144818
dnhatn wants to merge 3 commits intoelastic:mainfrom
dnhatn:query-tsid-prefix-partitions

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Mar 24, 2026

This change wires the prefix partitions introduced in #144617 to the compute engine.

Today, we partition the rate query by interval via replacing round_to with query_and_tags. With 10k time-series and a 5-minute bucket, each interval query reads all 10k time-series from every segment. In the rate aggregation, we buffer data points for all 10k time-series and maintain a priority queue across all of them within each interval. This approach increases concurrency to avoid underutilizing CPUs, but adds overhead and is not I/O friendly due to fragmented reads.

With prefix partitions, we partition data by groups of contiguous time-series instead. For example, 10k time-series can be split into 1024 groups of ~10 each. Each group reads all matching data points, and because these time-series are co-located in each segment, reads are sequential and I/O friendly. In the rate aggregation, the priority queue manages only ~10 time-series per group instead of 10k, significantly reducing overhead and memory usage. To avoid excessive overhead from tiny partitions, we merge adjacent partitions up to a target size (500k docs).

When prefix partitioning is not available (e.g., older codec without prefix layout), we fall back to the current behavior.

@dnhatn dnhatn added :StorageEngine/ES|QL Timeseries / metrics / PromQL / logsdb capabilities in ES|QL >non-issue labels Mar 24, 2026
@dnhatn dnhatn requested a review from kkrik-es March 24, 2026 03:56
}
}

List<List<PartialLeafReaderContext>> partition(List<LeafReaderContext> leaves, int docsPerSlice) throws IOException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change.

@dnhatn dnhatn marked this pull request as ready for review March 24, 2026 03:58
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elastic elastic deleted a comment from elasticmachine Mar 24, 2026
@dnhatn
Copy link
Member Author

dnhatn commented Mar 24, 2026

@kkrik-es I think there is a bug in the combine partitions that can drop some slices - didn't figure it out for a while (tests didn't catch it). I think the win should be much smaller (and more realistic). I am running the benchmark again.

@dnhatn
Copy link
Member Author

dnhatn commented Mar 24, 2026

Buildkite benchmark this with tsdb-metricsgen-270m please

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 24, 2026

💚 Build Succeeded

This build ran two tsdb-metricsgen-270m benchmarks to evaluate performance impact of this PR.

History

out.writeByte(id);
byte val = id;
if (this == TIME_SERIES && out.getTransportVersion().supports(TIME_SERIES_PARTITIONING) == false) {
val = DOC.id; // make time-series as DOC
Copy link
Contributor

@kkrik-es kkrik-es Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val = DOC.id; // make time-series as DOC
val = DOC.id; // fall back to DOC partitioning strategy for time-series

final Map<Integer, PrefixGroup> groups = new TreeMap<>(); // ordered by prefixes
PartitionedDocValues.PrefixPartitions prefixPartitions = null;
for (LeafReaderContext leaf : leaves) {
var tsid = leaf.reader().getSortedDocValues(TimeSeriesIdFieldMapper.NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: rename to tsidValues?

return combineGroups(groups.values().stream().toList(), docsPerSlice);
}

private List<List<PartialLeafReaderContext>> combineGroups(List<PrefixGroup> groups, int docsPerSlice) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's add a comment outlining what this does. Iiuc it combines groups to create chunkier slices so that they can be assigned to separate threads and be processed in parallel efficiently.

.put("mode", "time_series")
.putList("routing_path", List.of("host", "cluster"))
.put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, 3)
.put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use a random(1,3) here? We can assert on the partitioning strategy only when this is 1.

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, Nhat!

@kkrik-es
Copy link
Contributor

Hm results show very modest wins.. Did the change apply?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :StorageEngine/ES|QL Timeseries / metrics / PromQL / logsdb capabilities in ES|QL Team:StorageEngine v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants