-
Notifications
You must be signed in to change notification settings - Fork 249
feat: Support ANSI mode sum expr #2600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Support ANSI mode sum expr #2600
Conversation
|
Draft PR to support sum function - WIP |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2600 +/- ##
=============================================
- Coverage 56.12% 45.96% -10.17%
- Complexity 976 1201 +225
=============================================
Files 119 147 +28
Lines 11743 13811 +2068
Branches 2251 2370 +119
=============================================
- Hits 6591 6348 -243
- Misses 4012 6422 +2410
+ Partials 1140 1041 -99 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| for i in 0..int_array.len() { | ||
| if !int_array.is_null(i) { | ||
| let v = int_array.value(i).to_i64().ok_or_else(|| { | ||
| DataFusionError::Internal("Failed to convert value to i64".to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to print the problematic value in the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review . I believe we wouldnt be needing these checks at all given how we check the data types multiple times (from planning phase , internal functions)
| running_sum, | ||
| )?, | ||
| _ => { | ||
| panic!("Unsupported data type {}", values.data_type()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line 138 it returns an Err when the conversion fails. Would it make sense to return an Err here too instead of panicking ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you .I made sure it is all error driven and not panic
970d693 to
c0715aa
Compare
|
@andygrove , @comphead I believe this should be ready for review (pending CI) |
| } | ||
|
|
||
| fn size(&self) -> usize { | ||
| std::mem::size_of_val(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this method ?
This returns the size of the pointer/reference &self
| "integer", | ||
| ))) | ||
| } else { | ||
| return Ok(None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return Ok(None); | |
| Ok(None) |
| fn new(eval_mode: EvalMode) -> Self { | ||
| if eval_mode == EvalMode::Try { | ||
| Self { | ||
| // Try mode starts with 0 (because if this is init to None we cant say if it is none due to all nulls or due to an overflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Try mode starts with 0 (because if this is init to None we cant say if it is none due to all nulls or due to an overflow | |
| // Try mode starts with 0 (because if this is init to None we cant say if it is none due to all nulls or due to an overflow) |
| (null.asInstanceOf[java.lang.Long], "b"), | ||
| (null.asInstanceOf[java.lang.Long], "b")), | ||
| "tbl") { | ||
| val res = sql("SELECT _2, sum(_1) FROM tbl group by 1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| val res = sql("SELECT _2, sum(_1) FROM tbl group by 1") | |
| val res = sql("SELECT _2, try_sum(_1) FROM tbl group by 1") |
The name of the test says try_sum
| case s: Sum => | ||
| if (AggSerde.sumDataTypeSupported(s.dataType) && !s.dataType | ||
| .isInstanceOf[DecimalType]) { | ||
| .isInstanceOf[DecimalType] && !integerTypes.contains(s.dataType)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the integer types are excluded now ?
Thats actually interesting why would window fall back, we planning to fallback on windows in #2726 but the test would still preserve windows, so I think we need to check a fallback reason |
Which issue does this PR close?
Closes #531 .
(Partially closes 531) . The ANSI changes to support
AVGwill be tracked in a new PRRationale for this change
DataFusion's default SUM doesn't match Spark's overflow semantics. This implementation ensures we get the exact same behavior as Spark for integer overflow in all 3 eval modes (ANSI , Legacy and Try mode)
What changes are included in this PR?
This PR adds native Rust implementation for SUM aggregation on integer types (byte, short, int, long)
Native changes (Rust):
(Inspired from
SumDecimal)SumIntegeraggregate function that handles SUM for all integer types (in coherence with Spark)( Implemented code in similar fashion of spark leveraging
Option<i64>to represent NULL and numeric values for sum , and an additional parameter calledhas_all_nullswhich is leveraged in Try mode to distinguish if NULL sum is caused by all NULL inputs or the fact that the sum overflowed. (Spark does this withshouldTrackIsEmptyand assigning NULL to running sum which is a long datatype) )Scala side changes :
CometSumto add ANSI support (ANSI and Try)eval_modeinstead offail_on_errorto supportLegacy, ANSI and Tryeval modesjava.lang.Longto avoid Scala's auto-boxing feature which auto casts objects to primitive types there by casting nulls to 0s) handling in both simple and GROUP BY agg .How are these changes tested?