feat: support `SELECT DISTINCT id FROM t ORDER BY id LIMIT n` query use GroupedTopKAggregateStream #19653

haohuaijin · 2026-01-05T15:53:04Z

Which issue does this PR close?

close #19638

Rationale for this change

see issue #19638

What changes are included in this PR?

Introduced LimitOptions struct limit field with both limit and optional descending ordering direction
Extended TopKAggregation optimizer rule to DISTINCT queries by recognizing GROUP BY queries without aggregates and setting the descending flag based on ordering direction
Enhanced GroupedTopKAggregateStream to handle DISTINCT by using group key as both priority queue key and value for DISTINCT operations
Updated Proto definitions to add optional descending field to AggLimit message for serialization/deserialization

benchmark result

Are these changes tested?

yes, add test case in aggregates_topk.slt

Are there any user-facing changes?

no

… GroupedTopKAggregateStream

haohuaijin · 2026-01-10T08:23:12Z

datafusion/physical-plan/src/aggregates/topk_stream.rs

+                        let mut cols = self.priority_map.emit()?;
+                        // For DISTINCT case (no aggregate expressions), only use the group key column
+                        // since the schema only has one field and key/value are the same
+                        if self.aggregate_arguments.is_empty() {
+                            cols.truncate(1);
+                        }


we can further improve this part, because for the query

select distinct id from t order by id limit 10

it do not have any aggregate, so we only need mantain the topk heap, and skip the group keys

kosiew

LGTM!

kosiew · 2026-01-14T08:08:20Z

datafusion/core/benches/topk_aggregate.rs

+    let batches = collect(plan, ctx.task_ctx()).await?;
+    assert_eq!(batches.len(), 1);
+    let batch = batches.first().unwrap();
+    assert_eq!(batch.num_rows(), 10);


LIMIT
instead of
10

haohuaijin · 2026-01-14T14:17:10Z

Thanks for your reviews @kosiew, already apply suggestion

feat: support SELECT DISTINCT id FROM t ORDER BY id LIMIT n query use…

af19d9f

… GroupedTopKAggregateStream

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) proto Related to proto crate physical-plan Changes to the physical-plan crate labels Jan 5, 2026

haohuaijin added 2 commits January 5, 2026 23:53

Merge branch 'main' into topk-distinct

f8d8ac7

update

a790c43

github-actions bot removed the logical-expr Logical plan and expressions label Jan 5, 2026

haohuaijin added 5 commits January 7, 2026 14:20

Merge branch 'main' into topk-distinct

8862990

Merge branch 'main' into topk-distinct

b6dca88

Merge branch 'main' into topk-distinct

5fffc64

fix merge issue

c2e7c33

update

da76832

haohuaijin commented Jan 10, 2026

View reviewed changes

kosiew approved these changes Jan 14, 2026

View reviewed changes

haohuaijin added 2 commits January 14, 2026 22:16

apply suggestion

0ea86bc

Merge branch 'main' into topk-distinct

8fd0376

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support `SELECT DISTINCT id FROM t ORDER BY id LIMIT n` query use GroupedTopKAggregateStream #19653

feat: support `SELECT DISTINCT id FROM t ORDER BY id LIMIT n` query use GroupedTopKAggregateStream #19653

haohuaijin commented Jan 5, 2026

Uh oh!

haohuaijin Jan 10, 2026

Uh oh!

kosiew left a comment

Uh oh!

kosiew Jan 14, 2026

Uh oh!

haohuaijin commented Jan 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: support SELECT DISTINCT id FROM t ORDER BY id LIMIT n query use GroupedTopKAggregateStream #19653

Are you sure you want to change the base?

feat: support SELECT DISTINCT id FROM t ORDER BY id LIMIT n query use GroupedTopKAggregateStream #19653

Conversation

haohuaijin commented Jan 5, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

benchmark result

Are these changes tested?

Are there any user-facing changes?

Uh oh!

haohuaijin Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

kosiew Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

haohuaijin commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: support `SELECT DISTINCT id FROM t ORDER BY id LIMIT n` query use GroupedTopKAggregateStream #19653

feat: support `SELECT DISTINCT id FROM t ORDER BY id LIMIT n` query use GroupedTopKAggregateStream #19653

haohuaijin commented Jan 14, 2026 •

edited

Loading