-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Deprecate AggregateUDFImpl::is_nullable in favor of return_field nullability inference #19688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate AggregateUDFImpl::is_nullable in favor of return_field nullability inference #19688
Conversation
This commit fixes issue apache#19612 where accumulators that don't implement retract_batch exhibit buggy behavior in window frame queries. When aggregate functions are used with window frames like `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`, DataFusion uses PlainAggregateWindowExpr which calls evaluate() multiple times on the same accumulator instance. Accumulators that use std::mem::take() in their evaluate() method consume their internal state, causing incorrect results on subsequent calls. 1. **percentile_cont**: Modified evaluate() to use mutable reference instead of consuming the Vec. Added retract_batch() support for both PercentileContAccumulator and DistinctPercentileContAccumulator. 2. **string_agg**: Changed SimpleStringAggAccumulator::evaluate() to clone the accumulated string instead of taking it. - datafusion/functions-aggregate/src/percentile_cont.rs: - Changed calculate_percentile() to take &mut [T::Native] instead of Vec<T::Native> - Updated PercentileContAccumulator::evaluate() to pass reference - Updated DistinctPercentileContAccumulator::evaluate() to clone values - Added retract_batch() implementation using HashMap for efficient removal - Updated PercentileContGroupsAccumulator::evaluate() for consistency - datafusion/functions-aggregate/src/string_agg.rs: - Changed evaluate() to use clone() instead of std::mem::take() - datafusion/sqllogictest/test_files/aggregate.slt: - Added test cases for percentile_cont with window frames - Added test comparing median() vs percentile_cont(0.5) behavior - Added test for string_agg cumulative window frame - docs/source/library-user-guide/functions/adding-udfs.md: - Added documentation about window-compatible accumulators - Explained evaluate() state preservation requirements - Documented retract_batch() implementation guidance Closes apache#19612
…ability inference This change improves how nullability is computed for aggregate UDF outputs by making it depend on the nullability of input fields, aligning with the pattern used for scalar UDFs. Changes: - Mark is_nullable() method as deprecated in AggregateUDFImpl trait - Update udaf_default_return_field() to compute output nullability from input fields: * Output is nullable if ANY input field is nullable * Output is non-nullable only if ALL input fields are non-nullable - Add deprecation migration guide in is_nullable() documentation - Add #[allow(deprecated)] to wrapper method calls in AggregateUDF and AliasedAggregateUDFImpl Testing: - Add 4 new tests validating nullability inference from input fields: * test_return_field_nullability_from_nullable_input * test_return_field_nullability_from_non_nullable_input * test_return_field_nullability_with_mixed_inputs * test_return_field_preserves_return_type - All existing tests continue to pass (test_partial_eq, test_partial_ord) - No regressions in aggregate function execution Documentation: - Create new docs/source/library-user-guide/functions/udf-nullability.md * Explains the nullability change and rationale * Provides migration guide for custom UDAF implementations * Includes examples for default and custom nullability behavior * References scalar UDF patterns - Update docs/source/library-user-guide/functions/adding-udfs.md * Add section on nullability of aggregate functions * Link to new comprehensive nullability documentation Fixes: apache#19511 (related to apache#18882)
| ## See Also | ||
|
|
||
| - [Adding User Defined Functions](adding-udfs.md) - General guide to implementing UDFs | ||
| - [Scalar UDF Nullability](#) - Similar concepts for scalar UDFs (which already use `return_field_from_args()`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A link target is missing here
|
|
||
| let arr = values[0].as_primitive::<T>(); | ||
| for value in arr.iter().flatten() { | ||
| self.distinct_values.values.remove(&Hashable(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a .slt test for this ?
| *to_remove.entry(v).or_default() += 1; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return early here is to_remove.is_empty() ?
Is this planned for a later PR ? |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I do not have confidence in this PR especially considering the issue it tackles.
- It is bleeding in changes from other PRs (#19618)
- It claims to close the issue (#19511) but this only addresses aggregate UDFs
- It includes details in the PR body like saying
COUNTshould be fixed but doesn't attempt that in this PR - There's small issues like how it deprecates from version 42 or the liberal use of
#[allow(...)]which would be caught if clippy was run
All these lead me to think that proper consideration hasn't been given to this PR so I am not very inclined towards it. I feel a lot of this code is generated by an LLM and hasn't been disclosed, or even tested.
Co-authored-by: Martin Grigorov <[email protected]>
Co-authored-by: Martin Grigorov <[email protected]>
Co-authored-by: Martin Grigorov <[email protected]>
Co-authored-by: Martin Grigorov <[email protected]>
Co-authored-by: Martin Grigorov <[email protected]>
no i was beginner in opensource, and actually what happened i mistakable pushed code of the another issue i did not created branch.. i apologises for this.... |
Co-authored-by: Martin Grigorov <[email protected]>
Co-authored-by: Martin Grigorov <[email protected]>
Co-authored-by: Martin Grigorov <[email protected]>
|
illl close this PR and create another with Clean..... |
Which issue does this PR close?
Closes #19511
Related to #18882
Rationale for this change
Currently,
AggregateUDFImpl::is_nullable()returnstrueby default for all UDAFs, regardless of input characteristics. This is not ideal because:return_field()MIN,MAX,SUM)return_field_from_args()for nullabilityWhat changes are included in this PR?
Core Changes
is_nullable()onAggregateUDFImpltrait with migration guidanceudaf_default_return_field()to compute nullability from input fields:Tests
Added 4 new tests validating nullability inference:
test_return_field_nullability_from_nullable_inputtest_return_field_nullability_from_non_nullable_inputtest_return_field_nullability_with_mixed_inputstest_return_field_preserves_return_typeDocumentation
docs/source/library-user-guide/functions/udf-nullability.mdwith migration guide and examplesadding-udfs.mdwith reference to nullability documentationAre these changes tested?
Yes. All existing tests pass, plus 4 new tests specifically for nullability behavior.
Are there any user-facing changes?
Deprecation warning: Users implementing
is_nullable()will see a deprecation warning directing them to usereturn_field()instead.Behavioral change: Default nullability now depends on input field nullability rather than always returning
true. Functions likeCOUNTthat need to always return non-nullable should overridereturn_field().This is a potentially breaking change for users who rely on the previous behavior of always-nullable outputs, but the new behavior is more correct and aligns with scalar UDF patterns.