Skip to content

Conversation

@GaneshPatil7517
Copy link

Which issue does this PR close?

Closes #19511
Related to #18882

Rationale for this change

Currently, AggregateUDFImpl::is_nullable() returns true by default for all UDAFs, regardless of input characteristics. This is not ideal because:

  1. The same nullability information is already encoded in return_field()
  2. Most aggregate functions should only be nullable if their inputs are nullable (e.g., MIN, MAX, SUM)
  3. This pattern doesn't align with scalar UDFs, which already use return_field_from_args() for nullability

What changes are included in this PR?

Core Changes

  • Deprecated is_nullable() on AggregateUDFImpl trait with migration guidance
  • Updated udaf_default_return_field() to compute nullability from input fields:
    • Output is nullable if ANY input field is nullable
    • Output is non-nullable only if ALL inputs are non-nullable

Tests

Added 4 new tests validating nullability inference:

  • test_return_field_nullability_from_nullable_input
  • test_return_field_nullability_from_non_nullable_input
  • test_return_field_nullability_with_mixed_inputs
  • test_return_field_preserves_return_type

Documentation

  • New docs/source/library-user-guide/functions/udf-nullability.md with migration guide and examples
  • Updated adding-udfs.md with reference to nullability documentation

Are these changes tested?

Yes. All existing tests pass, plus 4 new tests specifically for nullability behavior.

Are there any user-facing changes?

Deprecation warning: Users implementing is_nullable() will see a deprecation warning directing them to use return_field() instead.

Behavioral change: Default nullability now depends on input field nullability rather than always returning true. Functions like COUNT that need to always return non-nullable should override return_field().

This is a potentially breaking change for users who rely on the previous behavior of always-nullable outputs, but the new behavior is more correct and aligns with scalar UDF patterns.

This commit fixes issue apache#19612 where accumulators that don't implement
retract_batch exhibit buggy behavior in window frame queries.

When aggregate functions are used with window frames like
`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`, DataFusion uses
PlainAggregateWindowExpr which calls evaluate() multiple times on the
same accumulator instance. Accumulators that use std::mem::take() in
their evaluate() method consume their internal state, causing incorrect
results on subsequent calls.

1. **percentile_cont**: Modified evaluate() to use mutable reference
   instead of consuming the Vec. Added retract_batch() support for
   both PercentileContAccumulator and DistinctPercentileContAccumulator.

2. **string_agg**: Changed SimpleStringAggAccumulator::evaluate() to
   clone the accumulated string instead of taking it.

- datafusion/functions-aggregate/src/percentile_cont.rs:
  - Changed calculate_percentile() to take &mut [T::Native] instead of Vec<T::Native>
  - Updated PercentileContAccumulator::evaluate() to pass reference
  - Updated DistinctPercentileContAccumulator::evaluate() to clone values
  - Added retract_batch() implementation using HashMap for efficient removal
  - Updated PercentileContGroupsAccumulator::evaluate() for consistency

- datafusion/functions-aggregate/src/string_agg.rs:
  - Changed evaluate() to use clone() instead of std::mem::take()

- datafusion/sqllogictest/test_files/aggregate.slt:
  - Added test cases for percentile_cont with window frames
  - Added test comparing median() vs percentile_cont(0.5) behavior
  - Added test for string_agg cumulative window frame

- docs/source/library-user-guide/functions/adding-udfs.md:
  - Added documentation about window-compatible accumulators
  - Explained evaluate() state preservation requirements
  - Documented retract_batch() implementation guidance

Closes apache#19612
…ability inference

This change improves how nullability is computed for aggregate UDF outputs by making
it depend on the nullability of input fields, aligning with the pattern used for
scalar UDFs.

Changes:
- Mark is_nullable() method as deprecated in AggregateUDFImpl trait
- Update udaf_default_return_field() to compute output nullability from input fields:
  * Output is nullable if ANY input field is nullable
  * Output is non-nullable only if ALL input fields are non-nullable
- Add deprecation migration guide in is_nullable() documentation
- Add #[allow(deprecated)] to wrapper method calls in AggregateUDF and AliasedAggregateUDFImpl

Testing:
- Add 4 new tests validating nullability inference from input fields:
  * test_return_field_nullability_from_nullable_input
  * test_return_field_nullability_from_non_nullable_input
  * test_return_field_nullability_with_mixed_inputs
  * test_return_field_preserves_return_type
- All existing tests continue to pass (test_partial_eq, test_partial_ord)
- No regressions in aggregate function execution

Documentation:
- Create new docs/source/library-user-guide/functions/udf-nullability.md
  * Explains the nullability change and rationale
  * Provides migration guide for custom UDAF implementations
  * Includes examples for default and custom nullability behavior
  * References scalar UDF patterns
- Update docs/source/library-user-guide/functions/adding-udfs.md
  * Add section on nullability of aggregate functions
  * Link to new comprehensive nullability documentation

Fixes: apache#19511 (related to apache#18882)
@github-actions github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 7, 2026
## See Also

- [Adding User Defined Functions](adding-udfs.md) - General guide to implementing UDFs
- [Scalar UDF Nullability](#) - Similar concepts for scalar UDFs (which already use `return_field_from_args()`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A link target is missing here


let arr = values[0].as_primitive::<T>();
for value in arr.iter().flatten() {
self.distinct_values.values.remove(&Hashable(value));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a .slt test for this ?

*to_remove.entry(v).or_default() += 1;
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return early here is to_remove.is_empty() ?

@martin-g
Copy link
Member

martin-g commented Jan 8, 2026

Functions like COUNT that need to always return non-nullable should override return_field().

Is this planned for a later PR ?

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I do not have confidence in this PR especially considering the issue it tackles.

  • It is bleeding in changes from other PRs (#19618)
  • It claims to close the issue (#19511) but this only addresses aggregate UDFs
  • It includes details in the PR body like saying COUNT should be fixed but doesn't attempt that in this PR
  • There's small issues like how it deprecates from version 42 or the liberal use of #[allow(...)] which would be caught if clippy was run

All these lead me to think that proper consideration hasn't been given to this PR so I am not very inclined towards it. I feel a lot of this code is generated by an LLM and hasn't been disclosed, or even tested.

@GaneshPatil7517
Copy link
Author

GaneshPatil7517 commented Jan 9, 2026

Honestly I do not have confidence in this PR especially considering the issue it tackles.

All these lead me to think that proper consideration hasn't been given to this PR so I am not very inclined towards it. I feel a lot of this code is generated by an LLM and hasn't been disclosed, or even tested.

no i was beginner in opensource, and actually what happened i mistakable pushed code of the another issue i did not created branch.. i apologises for this....

GaneshPatil7517 and others added 3 commits January 9, 2026 22:25
@GaneshPatil7517
Copy link
Author

illl close this PR and create another with Clean.....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider changing nullability of UDFs to depend on inputs by default

3 participants