Skip to content

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Oct 9, 2025

Summary

Adds exact percentile_cont aggregate function as the counterpart to the existing approx_percentile_cont function.

What changes were made?

New Implementation

  • Created percentile_cont.rs with full implementation
  • PercentileCont struct implementing AggregateUDFImpl
  • PercentileContAccumulator for standard aggregation
  • DistinctPercentileContAccumulator for DISTINCT mode
  • PercentileContGroupsAccumulator for efficient grouped aggregation
  • calculate_percentile function with linear interpolation

Features

  • Exact calculation: Stores all values in memory for precise results
  • WITHIN GROUP syntax: Supports WITHIN GROUP (ORDER BY ...)
  • Interpolation: Uses linear interpolation between values
  • All numeric types: Works with integers, floats, and decimals
  • Ordered-set aggregate: Properly marked as is_ordered_set_aggregate()
  • GROUP BY support: Efficient grouped aggregation via GroupsAccumulator

Tests

Added comprehensive tests in aggregate.slt:

  • Error conditions validation
  • Basic percentile calculations (0.0, 0.25, 0.5, 0.75, 1.0)
  • Comparison with median function
  • Ascending and descending order
  • GROUP BY aggregation
  • NULL handling
  • Edge cases (empty sets, single values)
  • Float interpolation
  • Various numeric data types

Example Usage

-- Basic usage with WITHIN GROUP syntax
SELECT percentile_cont(0.75) WITHIN GROUP (ORDER BY column_name) 
FROM table_name;

-- With GROUP BY
SELECT category, percentile_cont(0.95) WITHIN GROUP (ORDER BY value)
FROM sales
GROUP BY category;

-- Compare with median (percentile_cont(0.5) == median)
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY price) FROM products;

Performance Considerations

Like median, this function stores all values in memory before computing results. For large datasets or when approximation is acceptable, use approx_percentile_cont instead.

Related Issues

Closes #6714

🤖 Generated with Claude Code

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Oct 9, 2025
@adriangb adriangb requested a review from Copilot October 9, 2025 04:03
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds an exact percentile_cont aggregate function as a counterpart to the existing approximate version. The function implements SQL standard percentile continuous calculation with linear interpolation between values.

  • Implements percentile_cont aggregate function with exact calculation using all values in memory
  • Supports WITHIN GROUP syntax for ordered-set aggregates with ascending/descending order
  • Includes both regular and grouped accumulator implementations for efficient GROUP BY operations

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
datafusion/functions-aggregate/src/percentile_cont.rs New module implementing the complete percentile_cont aggregate function with accumulators
datafusion/functions-aggregate/src/lib.rs Module registration and export of the new percentile_cont function
datafusion/sqllogictest/test_files/aggregate.slt Comprehensive test suite covering error conditions, basic functionality, and edge cases

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@adriangb
Copy link
Contributor Author

adriangb commented Oct 9, 2025

@jonmmease would you be willing to review this PR? I mostly let Claude loose on it after some guidance, be warned that it generated most of the code. I looked over the work and it actually seems surprisingly good to me but I also haven't implemented an aggregation function in DataFusion before and there's a lot of fidly bits I'm just not sure about.

@adriangb adriangb requested a review from Copilot October 9, 2025 04:40
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 9, 2025
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jonmmease and @2010YOUY01 I believe you two worked on implementations of this before (#6718 and #7337)

/// result, but if cardinality is low then memory usage will also be lower.
#[derive(PartialEq, Eq, Hash)]
pub struct PercentileCont {
signature: Signature,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider adding a quantile_cont alias, which is the name DuckDB uses?

Comment on lines 210 to 229
fn get_scalar_value(expr: &Arc<dyn PhysicalExpr>) -> Result<ScalarValue> {
use arrow::array::RecordBatch;
use arrow::datatypes::Schema;
use datafusion_expr::ColumnarValue;

let empty_schema = Arc::new(Schema::empty());
let batch = RecordBatch::new_empty(Arc::clone(&empty_schema));
if let ColumnarValue::Scalar(s) = expr.evaluate(&batch)? {
Ok(s)
} else {
internal_err!("Didn't expect ColumnarValue::Array")
}
}

fn validate_percentile(expr: &Arc<dyn PhysicalExpr>) -> Result<f64> {
let percentile = match get_scalar_value(expr)
.map_err(|_| not_impl_datafusion_err!("Percentile value for 'PERCENTILE_CONT' must be a literal"))? {
ScalarValue::Float32(Some(value)) => {
value as f64
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to deduplicate this code with approx_percentile_cont; also I wonder if we could clean some of it up. get_scalar_value feels a bit hacky, I wonder if there is some other code that exists to do this for us already 🤔

Also the error message for percentile value needing to be a literal would probably be a plan error instead of NotImplemented as I'm not sure we ever plan to implement something like that (how would we pass an array instead of a scalar for that 🤔 )

);
}

let percentile = validate_percentile(&args.exprs[1])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of the code here feels duplicated with create_accumulator(), not sure how much it matters though

Comment on lines +455 to +427
// Cast to target type if needed (e.g., integer to Float64)
let values = if values[0].data_type() != &self.data_type {
arrow::compute::cast(&values[0], &self.data_type)?
} else {
Arc::clone(&values[0])
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cast strictly necessary? Can we assume data coming into update_batch() has been coerced to expected type?

Comment on lines +793 to +767
} else if percentile == 0.0 {
// Get minimum value
values.sort_by(cmp);
Some(values[0])
} else if percentile == 1.0 {
// Get maximum value
values.sort_by(cmp);
Some(values[len - 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these 0.0 and 1.0 cases I wonder if we can simplify earlier to switch percentile_cont to min/max respectively? 🤔

(Otherwise I think can just use select_nth_unstable_by() here like below instead of doing a full sort)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add some tests without the WITHIN GROUP clause, as well as a test with percentile of 0.4 for ascending and 0.6 for descending on the same column to show they should give the same result

adriangb and others added 9 commits October 10, 2025 14:19
Adds exact percentile_cont aggregate function as the counterpart
to the existing approx_percentile_cont function.

This implementation:
- Calculates exact percentiles by storing all values in memory
- Supports WITHIN GROUP (ORDER BY ...) syntax
- Uses linear interpolation between values for continuous percentiles
- Handles all numeric types (integers, floats, decimals)
- Supports DISTINCT mode
- Includes GroupsAccumulator for efficient grouped aggregation
- Comprehensive test coverage

Closes apache#6714

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add INTERPOLATION_PRECISION constant for magic number 1000000
- Fix size calculation to use size_of::<Vec<T::Native>>() instead of size_of::<Vec<T>>()
- Add additional test cases for integer interpolation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Improve INTERPOLATION_PRECISION documentation with detailed explanation
- Add comprehensive safety comment for wrapping arithmetic usage
- Expand unsafe code justification for OffsetBuffer::new_unchecked
- Fix error message to avoid displaying trait object

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@adriangb
Copy link
Contributor Author

Thank you for your review @Jefffrey. I think I've addressed the feedback, let me know if there's anything else you can spot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add percentile_cont aggregation function

2 participants