-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: Add percentile_cont aggregate function #17988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
343ab84
to
a7383c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds an exact percentile_cont
aggregate function as a counterpart to the existing approximate version. The function implements SQL standard percentile continuous calculation with linear interpolation between values.
- Implements
percentile_cont
aggregate function with exact calculation using all values in memory - Supports WITHIN GROUP syntax for ordered-set aggregates with ascending/descending order
- Includes both regular and grouped accumulator implementations for efficient GROUP BY operations
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
datafusion/functions-aggregate/src/percentile_cont.rs | New module implementing the complete percentile_cont aggregate function with accumulators |
datafusion/functions-aggregate/src/lib.rs | Module registration and export of the new percentile_cont function |
datafusion/sqllogictest/test_files/aggregate.slt | Comprehensive test suite covering error conditions, basic functionality, and edge cases |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
@jonmmease would you be willing to review this PR? I mostly let Claude loose on it after some guidance, be warned that it generated most of the code. I looked over the work and it actually seems surprisingly good to me but I also haven't implemented an aggregation function in DataFusion before and there's a lot of fidly bits I'm just not sure about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jonmmease and @2010YOUY01 I believe you two worked on implementations of this before (#6718 and #7337)
/// result, but if cardinality is low then memory usage will also be lower. | ||
#[derive(PartialEq, Eq, Hash)] | ||
pub struct PercentileCont { | ||
signature: Signature, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider adding a quantile_cont
alias, which is the name DuckDB uses?
fn get_scalar_value(expr: &Arc<dyn PhysicalExpr>) -> Result<ScalarValue> { | ||
use arrow::array::RecordBatch; | ||
use arrow::datatypes::Schema; | ||
use datafusion_expr::ColumnarValue; | ||
|
||
let empty_schema = Arc::new(Schema::empty()); | ||
let batch = RecordBatch::new_empty(Arc::clone(&empty_schema)); | ||
if let ColumnarValue::Scalar(s) = expr.evaluate(&batch)? { | ||
Ok(s) | ||
} else { | ||
internal_err!("Didn't expect ColumnarValue::Array") | ||
} | ||
} | ||
|
||
fn validate_percentile(expr: &Arc<dyn PhysicalExpr>) -> Result<f64> { | ||
let percentile = match get_scalar_value(expr) | ||
.map_err(|_| not_impl_datafusion_err!("Percentile value for 'PERCENTILE_CONT' must be a literal"))? { | ||
ScalarValue::Float32(Some(value)) => { | ||
value as f64 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to deduplicate this code with approx_percentile_cont; also I wonder if we could clean some of it up. get_scalar_value
feels a bit hacky, I wonder if there is some other code that exists to do this for us already 🤔
Also the error message for percentile value needing to be a literal would probably be a plan error instead of NotImplemented as I'm not sure we ever plan to implement something like that (how would we pass an array instead of a scalar for that 🤔 )
); | ||
} | ||
|
||
let percentile = validate_percentile(&args.exprs[1])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit of the code here feels duplicated with create_accumulator()
, not sure how much it matters though
// Cast to target type if needed (e.g., integer to Float64) | ||
let values = if values[0].data_type() != &self.data_type { | ||
arrow::compute::cast(&values[0], &self.data_type)? | ||
} else { | ||
Arc::clone(&values[0]) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cast strictly necessary? Can we assume data coming into update_batch()
has been coerced to expected type?
} else if percentile == 0.0 { | ||
// Get minimum value | ||
values.sort_by(cmp); | ||
Some(values[0]) | ||
} else if percentile == 1.0 { | ||
// Get maximum value | ||
values.sort_by(cmp); | ||
Some(values[len - 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these 0.0
and 1.0
cases I wonder if we can simplify earlier to switch percentile_cont
to min
/max
respectively? 🤔
(Otherwise I think can just use select_nth_unstable_by()
here like below instead of doing a full sort)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should add some tests without the WITHIN GROUP
clause, as well as a test with percentile of 0.4
for ascending and 0.6
for descending on the same column to show they should give the same result
Adds exact percentile_cont aggregate function as the counterpart to the existing approx_percentile_cont function. This implementation: - Calculates exact percentiles by storing all values in memory - Supports WITHIN GROUP (ORDER BY ...) syntax - Uses linear interpolation between values for continuous percentiles - Handles all numeric types (integers, floats, decimals) - Supports DISTINCT mode - Includes GroupsAccumulator for efficient grouped aggregation - Comprehensive test coverage Closes apache#6714 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Add INTERPOLATION_PRECISION constant for magic number 1000000 - Fix size calculation to use size_of::<Vec<T::Native>>() instead of size_of::<Vec<T>>() - Add additional test cases for integer interpolation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Improve INTERPOLATION_PRECISION documentation with detailed explanation - Add comprehensive safety comment for wrapping arithmetic usage - Expand unsafe code justification for OffsetBuffer::new_unchecked - Fix error message to avoid displaying trait object 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
a598e7b
to
ce6051f
Compare
Thank you for your review @Jefffrey. I think I've addressed the feedback, let me know if there's anything else you can spot. |
Summary
Adds exact
percentile_cont
aggregate function as the counterpart to the existingapprox_percentile_cont
function.What changes were made?
New Implementation
percentile_cont.rs
with full implementationPercentileCont
struct implementingAggregateUDFImpl
PercentileContAccumulator
for standard aggregationDistinctPercentileContAccumulator
for DISTINCT modePercentileContGroupsAccumulator
for efficient grouped aggregationcalculate_percentile
function with linear interpolationFeatures
WITHIN GROUP (ORDER BY ...)
is_ordered_set_aggregate()
Tests
Added comprehensive tests in
aggregate.slt
:median
functionExample Usage
Performance Considerations
Like
median
, this function stores all values in memory before computing results. For large datasets or when approximation is acceptable, useapprox_percentile_cont
instead.Related Issues
Closes #6714
🤖 Generated with Claude Code