-
Notifications
You must be signed in to change notification settings - Fork 267
feat: do not fallback to Spark for COUNT(distinct)
#2429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Might be fixed by #2407 Running |
|
I think it was fixed by several issues. I think #2407 is one, but I think the native shuffle rewrite and bumping the Arrow Java version contributed as well. It's encouraging to see! |
|
I'm adding more tests to make sure it is working now |
| binding: Boolean, | ||
| conf: SQLConf): Option[AggExpr] = { | ||
|
|
||
| if (aggExpr.isDistinct) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to pass the aggExpr.isDistinct value into the protobuf plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good point, I was thinking the same but IMO Spark doesn't call the count distinct on partial phase.
+-----------------------------+
| Driver |
| COUNT(DISTINCT name) |
+-------------+---------------+
|
v
+-------------------+ +-------------------+
| Executor 1 | | Executor 2 |
| Partitions P0,P1 | | Partitions P2,P3 |
| Local distinct: | | Local distinct: |
| {Alice,Bob,Eve} | | {Mallory,Eve,Bob,Trent}|
+---------+---------+ +---------+---------+
| |
| Shuffle |
v v
+--------------+ +--------------+
| Reducer R0 | | Reducer R1 |
| {Alice,Bob,Eve} | {Mallory,Trent} |
+--------------+ +--------------+
\ /
\ /
\ /
+-----------+-------------+
v
Driver Final Merge
DISTINCT = 5
The local distinct is made by HashAggregate so when count distinct get called as aggExpr it might not be needing the flag as data already deduped on reducers. Checking the Final stage though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining that. We should add tests for other distinct aggregates as well, such as sum and avg. I'm not sure if there are others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark has tests for the following aggregates with DISTINCT:
- count
- sum
- avg
- first
- last
- corr
- var_pop
- var_samp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, we could just remove the fallback for COUNT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, will do once
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2429 +/- ##
============================================
+ Coverage 56.12% 58.48% +2.35%
- Complexity 976 1440 +464
============================================
Files 119 146 +27
Lines 11743 13519 +1776
Branches 2251 2352 +101
============================================
+ Hits 6591 7906 +1315
- Misses 4012 4379 +367
- Partials 1140 1234 +94 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
depends on #2258 |
| val df = spark.read.parquet(filename) | ||
| df.createOrReplaceTempView("t1") | ||
| for (col <- df.columns) { | ||
| val sql = s"SELECT count(distinct $col) FROM t1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also add tests for count distinct with multiple columns e.g. COUNT(DISTINCT col1, col2, col3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove I'll add them separately as well
|
Fuzz tests fall to Spark checking |
COUNT(distinct)
|
Can we add a case to CometAggregateBenchmark to tout our gains? :) |
Added |
mbutrovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I guess Comet already knew how to serde everything and it was just a matter of removing the fallback and beefing up tests. Thanks @comphead!
I also ran the benchmark locally: ~1.4x on most cases! 🚀
|
I found that we fall back to Spark when there are multiple count distinct e.g. This isn't an issue for this PR, but thought I should make a note of this. |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @comphead!
Filed #2456 |
* feat: do not fallback to Spark for distincts
Which issue does this PR close?
Just experiment, related to #2292
Closes #.
Rationale for this change
What changes are included in this PR?
How are these changes tested?