feat: Add percentile_cont aggregate function #17988

adriangb · 2025-10-09T03:57:22Z

Summary

Adds exact percentile_cont aggregate function as the counterpart to the existing approx_percentile_cont function.

What changes were made?

New Implementation

Created percentile_cont.rs with full implementation
PercentileCont struct implementing AggregateUDFImpl
PercentileContAccumulator for standard aggregation
DistinctPercentileContAccumulator for DISTINCT mode
PercentileContGroupsAccumulator for efficient grouped aggregation
calculate_percentile function with linear interpolation

Features

Exact calculation: Stores all values in memory for precise results
WITHIN GROUP syntax: Supports WITHIN GROUP (ORDER BY ...)
Interpolation: Uses linear interpolation between values
All numeric types: Works with integers, floats, and decimals
Ordered-set aggregate: Properly marked as is_ordered_set_aggregate()
GROUP BY support: Efficient grouped aggregation via GroupsAccumulator

Tests

Added comprehensive tests in aggregate.slt:

Error conditions validation
Basic percentile calculations (0.0, 0.25, 0.5, 0.75, 1.0)
Comparison with median function
Ascending and descending order
GROUP BY aggregation
NULL handling
Edge cases (empty sets, single values)
Float interpolation
Various numeric data types

Example Usage

-- Basic usage with WITHIN GROUP syntax
SELECT percentile_cont(0.75) WITHIN GROUP (ORDER BY column_name) 
FROM table_name;

-- With GROUP BY
SELECT category, percentile_cont(0.95) WITHIN GROUP (ORDER BY value)
FROM sales
GROUP BY category;

-- Compare with median (percentile_cont(0.5) == median)
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY price) FROM products;

Performance Considerations

Like median, this function stores all values in memory before computing results. For large datasets or when approximation is acceptable, use approx_percentile_cont instead.

Related Issues

Closes #6714

🤖 Generated with Claude Code

Copilot

Pull Request Overview

This PR adds an exact percentile_cont aggregate function as a counterpart to the existing approximate version. The function implements SQL standard percentile continuous calculation with linear interpolation between values.

Implements percentile_cont aggregate function with exact calculation using all values in memory
Supports WITHIN GROUP syntax for ordered-set aggregates with ascending/descending order
Includes both regular and grouped accumulator implementations for efficient GROUP BY operations

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
datafusion/functions-aggregate/src/percentile_cont.rs	New module implementing the complete percentile_cont aggregate function with accumulators
datafusion/functions-aggregate/src/lib.rs	Module registration and export of the new percentile_cont function
datafusion/sqllogictest/test_files/aggregate.slt	Comprehensive test suite covering error conditions, basic functionality, and edge cases

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

datafusion/functions-aggregate/src/percentile_cont.rs

adriangb · 2025-10-09T04:06:07Z

@jonmmease would you be willing to review this PR? I mostly let Claude loose on it after some guidance, be warned that it generated most of the code. I looked over the work and it actually seems surprisingly good to me but I also haven't implemented an aggregation function in DataFusion before and there's a lot of fidly bits I'm just not sure about.

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

datafusion/functions-aggregate/src/percentile_cont.rs

Jefffrey

cc @jonmmease and @2010YOUY01 I believe you two worked on implementations of this before (#6718 and #7337)

Jefffrey · 2025-10-09T14:47:47Z

datafusion/functions-aggregate/src/percentile_cont.rs

+/// result, but if cardinality is low then memory usage will also be lower.
+#[derive(PartialEq, Eq, Hash)]
+pub struct PercentileCont {
+    signature: Signature,


Should we consider adding a quantile_cont alias, which is the name DuckDB uses?

Jefffrey · 2025-10-09T14:55:19Z

datafusion/functions-aggregate/src/percentile_cont.rs

+fn get_scalar_value(expr: &Arc<dyn PhysicalExpr>) -> Result<ScalarValue> {
+    use arrow::array::RecordBatch;
+    use arrow::datatypes::Schema;
+    use datafusion_expr::ColumnarValue;
+
+    let empty_schema = Arc::new(Schema::empty());
+    let batch = RecordBatch::new_empty(Arc::clone(&empty_schema));
+    if let ColumnarValue::Scalar(s) = expr.evaluate(&batch)? {
+        Ok(s)
+    } else {
+        internal_err!("Didn't expect ColumnarValue::Array")
+    }
+}
+
+fn validate_percentile(expr: &Arc<dyn PhysicalExpr>) -> Result<f64> {
+    let percentile = match get_scalar_value(expr)
+        .map_err(|_| not_impl_datafusion_err!("Percentile value for 'PERCENTILE_CONT' must be a literal"))? {
+        ScalarValue::Float32(Some(value)) => {
+            value as f64
+        }


Would be nice to deduplicate this code with approx_percentile_cont; also I wonder if we could clean some of it up. get_scalar_value feels a bit hacky, I wonder if there is some other code that exists to do this for us already 🤔

Also the error message for percentile value needing to be a literal would probably be a plan error instead of NotImplemented as I'm not sure we ever plan to implement something like that (how would we pass an array instead of a scalar for that 🤔 )

Jefffrey · 2025-10-09T14:56:57Z

datafusion/functions-aggregate/src/percentile_cont.rs

+            );
+        }
+
+        let percentile = validate_percentile(&args.exprs[1])?;


Bit of the code here feels duplicated with create_accumulator(), not sure how much it matters though

Jefffrey · 2025-10-09T14:58:37Z

datafusion/functions-aggregate/src/percentile_cont.rs

+        // Cast to target type if needed (e.g., integer to Float64)
+        let values = if values[0].data_type() != &self.data_type {
+            arrow::compute::cast(&values[0], &self.data_type)?
+        } else {
+            Arc::clone(&values[0])
+        };


Is this cast strictly necessary? Can we assume data coming into update_batch() has been coerced to expected type?

Jefffrey · 2025-10-09T15:03:36Z

datafusion/functions-aggregate/src/percentile_cont.rs

+    } else if percentile == 0.0 {
+        // Get minimum value
+        values.sort_by(cmp);
+        Some(values[0])
+    } else if percentile == 1.0 {
+        // Get maximum value
+        values.sort_by(cmp);
+        Some(values[len - 1])


For these 0.0 and 1.0 cases I wonder if we can simplify earlier to switch percentile_cont to min/max respectively? 🤔

(Otherwise I think can just use select_nth_unstable_by() here like below instead of doing a full sort)

Jefffrey · 2025-10-09T15:06:47Z

datafusion/sqllogictest/test_files/aggregate.slt

Perhaps we should add some tests without the WITHIN GROUP clause, as well as a test with percentile of 0.4 for ascending and 0.6 for descending on the same column to show they should give the same result

Adds exact percentile_cont aggregate function as the counterpart to the existing approx_percentile_cont function. This implementation: - Calculates exact percentiles by storing all values in memory - Supports WITHIN GROUP (ORDER BY ...) syntax - Uses linear interpolation between values for continuous percentiles - Handles all numeric types (integers, floats, decimals) - Supports DISTINCT mode - Includes GroupsAccumulator for efficient grouped aggregation - Comprehensive test coverage Closes apache#6714 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add INTERPOLATION_PRECISION constant for magic number 1000000 - Fix size calculation to use size_of::<Vec<T::Native>>() instead of size_of::<Vec<T>>() - Add additional test cases for integer interpolation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Improve INTERPOLATION_PRECISION documentation with detailed explanation - Add comprehensive safety comment for wrapping arithmetic usage - Expand unsafe code justification for OffsetBuffer::new_unchecked - Fix error message to avoid displaying trait object 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

adriangb · 2025-10-10T20:25:37Z

Thank you for your review @Jefffrey. I think I've addressed the feedback, let me know if there's anything else you can spot.

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Oct 9, 2025

adriangb force-pushed the percentile-cont branch from 343ab84 to a7383c6 Compare October 9, 2025 03:57

adriangb requested a review from Copilot October 9, 2025 04:03

Copilot AI reviewed Oct 9, 2025

View reviewed changes

datafusion/functions-aggregate/src/percentile_cont.rs Outdated Show resolved Hide resolved

datafusion/functions-aggregate/src/percentile_cont.rs Outdated Show resolved Hide resolved

adriangb requested a review from Copilot October 9, 2025 04:40

Copilot AI reviewed Oct 9, 2025

View reviewed changes

github-actions bot added the documentation Improvements or additions to documentation label Oct 9, 2025

Jefffrey reviewed Oct 9, 2025

View reviewed changes

adriangb and others added 9 commits October 10, 2025 14:19

fixes

303c4dd

always return floats

a029e98

some refactoring

4d27fbf

update

a11aa37

wip on review feedback

24469a7

move to private utils

ce6051f

adriangb force-pushed the percentile-cont branch from a598e7b to ce6051f Compare October 10, 2025 19:45

adriangb added 2 commits October 10, 2025 15:02

add missing file

dea507d

add tests

6e9db91

update docs

2873155

feat: Add percentile_cont aggregate function #17988

Are you sure you want to change the base?

feat: Add percentile_cont aggregate function #17988

Uh oh!

Conversation

adriangb commented Oct 9, 2025

Summary

What changes were made?

New Implementation

Features

Tests

Example Usage

Performance Considerations

Related Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

adriangb commented Oct 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Jefffrey Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants