feat: do not fallback to Spark for `COUNT(distinct)` #2429

comphead · 2025-09-20T20:32:09Z

Which issue does this PR close?

Just experiment, related to #2292
Closes #.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

comphead · 2025-09-20T21:44:14Z

Might be fixed by #2407
@andygrove WDYT do you have anything in your mind to double test the count(distinct) ?

Running CometFuzzAggregateSuite

mbutrovich · 2025-09-20T22:16:41Z

I think it was fixed by several issues. I think #2407 is one, but I think the native shuffle rewrite and bumping the Arrow Java version contributed as well. It's encouraging to see!

comphead · 2025-09-21T16:42:33Z

I'm adding more tests to make sure it is working now

andygrove · 2025-09-21T20:37:15Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

      binding: Boolean,
      conf: SQLConf): Option[AggExpr] = {

-    if (aggExpr.isDistinct) {


Don't we need to pass the aggExpr.isDistinct value into the protobuf plan?

this is a good point, I was thinking the same but IMO Spark doesn't call the count distinct on partial phase.

+-----------------------------+ | Driver | | COUNT(DISTINCT name) | +-------------+---------------+ | v +-------------------+ +-------------------+ | Executor 1 | | Executor 2 | | Partitions P0,P1 | | Partitions P2,P3 | | Local distinct: | | Local distinct: | | {Alice,Bob,Eve} | | {Mallory,Eve,Bob,Trent}| +---------+---------+ +---------+---------+ | | | Shuffle | v v +--------------+ +--------------+ | Reducer R0 | | Reducer R1 | | {Alice,Bob,Eve} | {Mallory,Trent} | +--------------+ +--------------+ \ / \ / \ / +-----------+-------------+ v Driver Final Merge DISTINCT = 5

The local distinct is made by HashAggregate so when count distinct get called as aggExpr it might not be needing the flag as data already deduped on reducers. Checking the Final stage though

Thanks for explaining that. We should add tests for other distinct aggregates as well, such as sum and avg. I'm not sure if there are others?

Spark has tests for the following aggregates with DISTINCT:

count

sum

avg

first

last

corr

var_pop

var_samp

For this PR, we could just remove the fallback for COUNT?

yep, will do once

codecov-commenter · 2025-09-22T04:52:08Z

Codecov Report

❌ Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 58.48%. Comparing base (f09f8af) to head (9d3a40f).
⚠️ Report is 542 commits behind head on main.

Files with missing lines	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	75.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2429      +/-   ##
============================================
+ Coverage     56.12%   58.48%   +2.35%     
- Complexity      976     1440     +464     
============================================
  Files           119      146      +27     
  Lines         11743    13519    +1776     
  Branches       2251     2352     +101     
============================================
+ Hits           6591     7906    +1315     
- Misses         4012     4379     +367     
- Partials       1140     1234      +94

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

comphead · 2025-09-23T00:46:11Z

depends on #2258

andygrove · 2025-09-23T14:26:54Z

spark/src/test/scala/org/apache/comet/CometFuzzAggregateSuite.scala

    val df = spark.read.parquet(filename)
    df.createOrReplaceTempView("t1")
    for (col <- df.columns) {
      val sql = s"SELECT count(distinct $col) FROM t1"


could you also add tests for count distinct with multiple columns e.g. COUNT(DISTINCT col1, col2, col3)

Thanks @andygrove I'll add them separately as well

comphead · 2025-09-24T15:12:47Z

Fuzz tests fall to Spark

- count distinct (native_comet, native shuffle) *** FAILED *** (418 milliseconds)
  Expected only Comet native operators, but found HashAggregate.
  plan: HashAggregate(keys=[], functions=[count(distinct c0#92074)], output=[count(DISTINCT c0)#92169L])
  +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=716408]
     +- HashAggregate(keys=[], functions=[partial_count(distinct c0#92074)], output=[count#92173L])
        +- CometHashAggregate [c0#92074], [c0#92074]
           +- CometExchange hashpartitioning(c0#92074, 10), ENSURE_REQUIREMENTS, CometNativeShuffle, [plan_id=716405]
              +- CometHashAggregate [c0#92074], [c0#92074]
                 +- CometScan [native_comet] parquet [c0#92074] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)

checking

spark/src/test/scala/org/apache/comet/exec/CometExecSuite.scala

mbutrovich · 2025-09-24T19:42:04Z

Can we add a case to CometAggregateBenchmark to tout our gains? :)

comphead · 2025-09-24T23:12:50Z

Can we add a case to CometAggregateBenchmark to tout our gains? :)

Added

mbutrovich

LGTM. I guess Comet already knew how to serde everything and it was just a matter of removing the fallback and beefing up tests. Thanks @comphead!

I also ran the benchmark locally: ~1.4x on most cases! 🚀

CometAggregateBenchmark.txt

andygrove · 2025-09-25T20:29:08Z

I found that we fall back to Spark when there are multiple count distinct e.g.

  test("mulitple count distinct with group column") {
    val df = spark.read.parquet(filename)
    df.createOrReplaceTempView("t1")
    val sql = s"SELECT c1, count(distinct c2), count(distinct c3) FROM t1 group by c1"
    val (_, cometPlan) = checkSparkAnswer(sql)
    if (usingDataSourceExec) {
      assert(1 == collectNativeScans(cometPlan).length)
    }
  }

This isn't an issue for this PR, but thought I should make a note of this.

Comet cannot accelerate HashAggregateExec because: Aggregate expression with filter is not supported

andygrove

LGTM. Thanks @comphead!

comphead · 2025-09-25T20:54:55Z

I found that we fall back to Spark when there are multiple count distinct e.g.

  test("mulitple count distinct with group column") {
    val df = spark.read.parquet(filename)
    df.createOrReplaceTempView("t1")
    val sql = s"SELECT c1, count(distinct c2), count(distinct c3) FROM t1 group by c1"
    val (_, cometPlan) = checkSparkAnswer(sql)
    if (usingDataSourceExec) {
      assert(1 == collectNativeScans(cometPlan).length)
    }
  }

This isn't an issue for this PR, but thought I should make a note of this.

Comet cannot accelerate HashAggregateExec because: Aggregate expression with filter is not supported

Filed #2456

* feat: do not fallback to Spark for distincts

comphead changed the title ~~feat: do not fallback to Spark for distincts~~ feat: do not fallback to Spark for distinct aggregates Sep 20, 2025

andygrove reviewed Sep 21, 2025

View reviewed changes

comphead force-pushed the dev2 branch from c4ebd39 to c0aedb6 Compare September 22, 2025 04:38

andygrove reviewed Sep 23, 2025

View reviewed changes

comphead added 4 commits September 23, 2025 13:44

feat: do not fallback to Spark for distincts

78761f2

fix: Adding more fuzz tests for count(distinct)

f81548d

fix: Adding more fuzz tests for count(distinct)

b795ec4

feat: do not fallback to Spark for distincts

4177f63

comphead force-pushed the dev2 branch from 81b38cf to 4177f63 Compare September 23, 2025 20:44

comphead added 3 commits September 23, 2025 13:46

feat: do not fallback to Spark for distincts

554abaf

feat: do not fallback to Spark for distincts

1356d51

feat: do not fallback to Spark for distincts

66103c9

feat: do not fallback to Spark for distincts

f029bf7

comphead changed the title ~~feat: do not fallback to Spark for distinct aggregates~~ feat: do not fallback to Spark for COUNT(distinct) Sep 24, 2025

andygrove reviewed Sep 24, 2025

View reviewed changes

spark/src/test/scala/org/apache/comet/exec/CometExecSuite.scala Show resolved Hide resolved

feat: do not fallback to Spark for distincts

0c20769

comphead marked this pull request as ready for review September 24, 2025 17:54

comphead requested a review from andygrove September 24, 2025 18:45

feat: do not fallback to Spark for distincts

9bb890e

comphead requested a review from mbutrovich September 24, 2025 23:13

feat: do not fallback to Spark for distincts

9d3a40f

mbutrovich approved these changes Sep 25, 2025

View reviewed changes

andygrove approved these changes Sep 25, 2025

View reviewed changes

comphead merged commit 5845227 into apache:main Sep 25, 2025
102 checks passed

mbutrovich added a commit to mbutrovich/datafusion-comet that referenced this pull request Sep 29, 2025

Update TPS-DC plans after apache#2429.

732afa1

mbutrovich mentioned this pull request Sep 29, 2025

chore: update TPS-DS plans after #2429 #2486

Merged

mbutrovich added a commit that referenced this pull request Sep 29, 2025

chore: update TPS-DC plans after #2429 (#2486)

4ec8010

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025

feat: do not fallback to Spark for COUNT(distinct) (apache#2429)

a8ea026

* feat: do not fallback to Spark for distincts

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025

chore: update TPS-DC plans after apache#2429 (apache#2486)

ea17182

feat: do not fallback to Spark for COUNT(distinct) #2429

feat: do not fallback to Spark for COUNT(distinct) #2429

Uh oh!

Conversation

comphead commented Sep 20, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

comphead commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comphead commented Sep 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

comphead commented Sep 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead commented Sep 24, 2025

Uh oh!

Uh oh!

mbutrovich commented Sep 24, 2025

Uh oh!

comphead commented Sep 24, 2025

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented Sep 25, 2025

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

comphead commented Sep 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: do not fallback to Spark for `COUNT(distinct)` #2429

feat: do not fallback to Spark for `COUNT(distinct)` #2429

comphead commented Sep 20, 2025 •

edited

Loading

mbutrovich commented Sep 20, 2025 •

edited

Loading

codecov-commenter commented Sep 22, 2025 •

edited

Loading