Partition rate query using tsid prefixes by dnhatn · Pull Request #144818 · elastic/elasticsearch

dnhatn · 2026-03-24T00:19:16Z

This change wires the prefix partitions introduced in #144617 to the compute engine.

Today, we partition the rate query by interval via replacing round_to with query_and_tags. With 10k time-series and a 5-minute bucket, each interval query reads all 10k time-series from every segment. In the rate aggregation, we buffer data points for all 10k time-series and maintain a priority queue across all of them within each interval. This approach increases concurrency to avoid underutilizing CPUs, but adds overhead and is not I/O friendly due to fragmented reads.

With prefix partitions, we partition data by groups of contiguous time-series instead. For example, 10k time-series can be split into 1024 groups of ~10 each. Each group reads all matching data points, and because these time-series are co-located in each segment, reads are sequential and I/O friendly. In the rate aggregation, the priority queue manages only ~10 time-series per group instead of 10k, significantly reducing overhead and memory usage. To avoid excessive overhead from tiny partitions, we merge adjacent partitions up to a target size (500k docs).

When prefix partitioning is not available (e.g., older codec without prefix layout), we fall back to the current behavior.

…rtitions

dnhatn · 2026-03-24T03:57:55Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java

+            }
+        }
+
+        List<List<PartialLeafReaderContext>> partition(List<LeafReaderContext> leaves, int docsPerSlice) throws IOException {


This is the main change.

elasticsearchmachine · 2026-03-24T03:58:48Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

dnhatn · 2026-03-24T05:31:14Z

@kkrik-es I think there is a bug in the combine partitions that can drop some slices - didn't figure it out for a while (tests didn't catch it). I think the win should be much smaller (and more realistic). I am running the benchmark again.

dnhatn · 2026-03-24T05:36:25Z

Buildkite benchmark this with tsdb-metricsgen-270m please

elasticmachine · 2026-03-24T05:39:08Z

💚 Build Succeeded

Buildkite Build
Commit: 1b3516f
Baseline: 64b90e6 (env ID 886b41eb-ab18-48cd-a57e-2bc8b02ff047)
Contender: 1b3516f (env ID 5ed031f9-7c3d-4e30-a42e-981d330c4e51)
Benchmark results

This build ran two tsdb-metricsgen-270m benchmarks to evaluate performance impact of this PR.

History

kkrik-es · 2026-03-24T05:47:30Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java

-            out.writeByte(id);
+            byte val = id;
+            if (this == TIME_SERIES && out.getTransportVersion().supports(TIME_SERIES_PARTITIONING) == false) {
+                val = DOC.id; // make time-series as DOC


Suggested change

val = DOC.id; // make time-series as DOC

val = DOC.id; // fall back to DOC partitioning strategy for time-series

kkrik-es · 2026-03-24T05:49:38Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java

+            final Map<Integer, PrefixGroup> groups = new TreeMap<>(); // ordered by prefixes
+            PartitionedDocValues.PrefixPartitions prefixPartitions = null;
+            for (LeafReaderContext leaf : leaves) {
+                var tsid = leaf.reader().getSortedDocValues(TimeSeriesIdFieldMapper.NAME);


Super nit: rename to tsidValues?

kkrik-es · 2026-03-24T05:55:07Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java

+            return combineGroups(groups.values().stream().toList(), docsPerSlice);
+        }
+
+        private List<List<PartialLeafReaderContext>> combineGroups(List<PrefixGroup> groups, int docsPerSlice) {


Nit: let's add a comment outlining what this does. Iiuc it combines groups to create chunkier slices so that they can be assigned to separate threads and be processed in parallel efficiently.

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java

kkrik-es · 2026-03-24T06:03:50Z

...ugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/TimeSeriesIT.java

            .put("mode", "time_series")
            .putList("routing_path", List.of("host", "cluster"))
-            .put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, 3)
+            .put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, 1)


Shall we use a random(1,3) here? We can assert on the partitioning strategy only when this is 1.

kkrik-es

Well done, Nhat!

kkrik-es · 2026-03-24T12:26:43Z

Hm results show very modest wins.. Did the change apply?

Partition rate query using tsid prefixes

0f1a391

elasticsearchmachine added the v9.4.0 label Mar 24, 2026

Merge remote-tracking branch 'elastic/main' into query-tsid-prefix-pa…

fa571bb

…rtitions

dnhatn added :StorageEngine/ES|QL Timeseries / metrics / PromQL / logsdb capabilities in ES|QL >non-issue labels Mar 24, 2026

dnhatn requested a review from kkrik-es March 24, 2026 03:56

dnhatn commented Mar 24, 2026

View reviewed changes

dnhatn marked this pull request as ready for review March 24, 2026 03:58

elasticsearchmachine added the Team:StorageEngine label Mar 24, 2026

use previous parameters

1b3516f

elastic deleted a comment from elasticmachine Mar 24, 2026

kkrik-es reviewed Mar 24, 2026

View reviewed changes

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java Show resolved Hide resolved

kkrik-es reviewed Mar 24, 2026

View reviewed changes

kkrik-es approved these changes Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition rate query using tsid prefixes#144818

Partition rate query using tsid prefixes#144818
dnhatn wants to merge 3 commits intoelastic:mainfrom
dnhatn:query-tsid-prefix-partitions

dnhatn commented Mar 24, 2026 •

edited

Loading

Uh oh!

dnhatn Mar 24, 2026

Uh oh!

elasticsearchmachine commented Mar 24, 2026

Uh oh!

dnhatn commented Mar 24, 2026 •

edited

Loading

Uh oh!

dnhatn commented Mar 24, 2026

Uh oh!

elasticmachine commented Mar 24, 2026 •

edited

Loading

Uh oh!

kkrik-es Mar 24, 2026 •

edited

Loading

Uh oh!

kkrik-es Mar 24, 2026

Uh oh!

kkrik-es Mar 24, 2026

Uh oh!

Uh oh!

kkrik-es Mar 24, 2026

Uh oh!

kkrik-es left a comment

Uh oh!

kkrik-es commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	val = DOC.id; // make time-series as DOC
	val = DOC.id; // fall back to DOC partitioning strategy for time-series

Conversation

dnhatn commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnhatn Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Mar 24, 2026

Uh oh!

dnhatn commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnhatn commented Mar 24, 2026

Uh oh!

elasticmachine commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

History

Uh oh!

kkrik-es Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkrik-es Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

kkrik-es Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kkrik-es Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

kkrik-es left a comment

Choose a reason for hiding this comment

Uh oh!

kkrik-es commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dnhatn commented Mar 24, 2026 •

edited

Loading

dnhatn commented Mar 24, 2026 •

edited

Loading

elasticmachine commented Mar 24, 2026 •

edited

Loading

kkrik-es Mar 24, 2026 •

edited

Loading