-
Notifications
You must be signed in to change notification settings - Fork 268
perf: Use DataFusion's count_udaf instead of SUM(IF(expr IS NOT NULL, 1, 0))
#2407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2407 +/- ##
============================================
+ Coverage 56.12% 57.52% +1.39%
- Complexity 976 1296 +320
============================================
Files 119 147 +28
Lines 11743 13469 +1726
Branches 2251 2352 +101
============================================
+ Hits 6591 7748 +1157
- Misses 4012 4457 +445
- Partials 1140 1264 +124 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I ran TPC-H locally and saw no regression in performance |
| benchmark.addCase(s"SQL Parquet - Comet (Scan) ($aggregateFunction)") { _ => | ||
| withSQLConf(CometConf.COMET_ENABLED.key -> "true") { | ||
| spark.sql(query).noop() | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It no longer makes sense to benchmark Comet with Scan vs Scan & Exec configurations, especially for aggregates. Comet scans won't accelerate aggregates.
mbutrovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know how I love a good negative line count. Thanks for following up on my comment in #2397. Glad to see we can rely on DF for this.
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove it is nice removing unnecessary evaluations
|
Merged. Thanks for the reviews @mbutrovich and @comphead |
Which issue does this PR close?
Rationale for this change
Use DataFusion's
count_udafinstead of translatingcount(expr)tosum(if(expr is not null, 1, 0)).When we tried to use
count_udaf(about a year ago - see #744) we ran into performance issues, which is why we added thesumworkaround, but this is no longer needed.What changes are included in this PR?
How are these changes tested?
I ran CometAggregateSuite: