Skip to content

Commit 4fb36b2

Browse files
fix: bitmap_count should report nullability correctly (#19195)
## Which issue does this PR close? Closes #19146 Part of #19144 (EPIC: fix nullability report for spark expression) ## Rationale for this change The `bitmap_count` UDF was using the default `return_type` implementation which does not preserve nullability information. This causes: 1. **Incorrect schema inference** - non-nullable binary inputs are incorrectly marked as producing nullable Int64 outputs 2. **Missed optimization opportunities** - the query optimizer cannot apply nullability-based optimizations when metadata is incorrect 3. **Inconsistent behavior** - other similar functions preserve nullability, leading to unexpected differences The `bitmap_count` function counts set bits in a binary input and returns an Int64. The operation itself doesn't introduce nullability - if the input is non-nullable, the output will always be non-nullable. Therefore, the output nullability should match the input. ## What changes are included in this PR? 1. **Implemented `return_field_from_args`**: Creates a field with Int64 type and the same nullability as the input field 2. **Updated `return_type`**: Now returns an error directing users to use `return_field_from_args` instead (following DataFusion best practices) 3. **Added `FieldRef` import** to support returning field metadata 4. **Added nullability test**: Verifies that both nullable and non-nullable inputs are handled correctly ## Are these changes tested? Yes, this PR includes a new test `test_bitmap_count_nullability` that verifies: - Non-nullable Binary input produces non-nullable Int64 output - Nullable Binary input produces nullable Int64 output - Data type is correctly set to Int64 in both cases Test results: ``` running 1 test test function::bitmap::bitmap_count::tests::test_bitmap_count_nullability ... ok test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured ``` Additionally, all existing `bitmap_count` tests continue to pass, ensuring backward compatibility. ## Are there any user-facing changes? **Yes - Schema metadata improvement:** Users will now see correct nullability information in the schema: **Before (Bug):** - Non-nullable binary input → nullable Int64 output (incorrect) **After (Fixed):** - Non-nullable binary input → non-nullable Int64 output (correct) - Nullable binary input → nullable Int64 output (correct) This is a **bug fix** that corrects schema metadata only - it does not change the actual computation or introduce any breaking changes to the API. **Impact:** - Query optimizers can now make better decisions based on accurate nullability information - Schema validation will be more accurate - No changes to function behavior or output values --- ## Code Changes Summary ### Modified File: `datafusion/spark/src/function/bitmap/bitmap_count.rs` #### 1. Added Import ```rust use arrow::datatypes::{DataType, FieldRef, Int16Type, Int32Type, Int64Type, Int8Type}; ``` #### 2. Updated return_type Method ```rust fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> { internal_err!("return_field_from_args should be used instead") } ``` #### 3. Added return_field_from_args Implementation ```rust fn return_field_from_args( &self, args: datafusion_expr::ReturnFieldArgs, ) -> Result<FieldRef> { use arrow::datatypes::Field; // bitmap_count returns Int64 with the same nullability as the input Ok(Arc::new(Field::new( args.arg_fields[0].name(), DataType::Int64, args.arg_fields[0].is_nullable(), ))) } ``` #### 4. Added Test ```rust #[test] fn test_bitmap_count_nullability() -> Result<()> { use datafusion_expr::ReturnFieldArgs; let bitmap_count = BitmapCount::new(); // Test with non-nullable binary field let non_nullable_field = Arc::new(Field::new("bin", DataType::Binary, false)); let result = bitmap_count.return_field_from_args(ReturnFieldArgs { arg_fields: &[Arc::clone(&non_nullable_field)], scalar_arguments: &[None], })?; // The result should not be nullable (same as input) assert!(!result.is_nullable()); assert_eq!(result.data_type(), &Int64); // Test with nullable binary field let nullable_field = Arc::new(Field::new("bin", DataType::Binary, true)); let result = bitmap_count.return_field_from_args(ReturnFieldArgs { arg_fields: &[Arc::clone(&nullable_field)], scalar_arguments: &[None], })?; // The result should be nullable (same as input) assert!(result.is_nullable()); assert_eq!(result.data_type(), &Int64); Ok(()) } ``` --- ## Verification Steps 1. **Run the new test:** ```bash cargo test -p datafusion-spark test_bitmap_count_nullability --lib ``` 2. **Run all bitmap_count tests:** ```bash cargo test -p datafusion-spark bitmap_count --lib ``` 3. **Run clippy checks:** ```bash cargo clippy -p datafusion-spark --all-targets -- -D warnings ``` All checks pass successfully! --- ## Related Issues - Closes: #19146 - Part of EPIC: #19144 (fix nullability report for spark expression) - Similar fix: #19145 (shuffle function nullability) ---
1 parent 2626fc2 commit 4fb36b2

File tree

1 file changed

+48
-2
lines changed

1 file changed

+48
-2
lines changed

datafusion/spark/src/function/bitmap/bitmap_count.rs

Lines changed: 48 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ use arrow::array::{
2525
use arrow::datatypes::DataType::{
2626
Binary, BinaryView, Dictionary, FixedSizeBinary, LargeBinary,
2727
};
28-
use arrow::datatypes::{DataType, Int16Type, Int32Type, Int64Type, Int8Type};
28+
use arrow::datatypes::{DataType, FieldRef, Int16Type, Int32Type, Int64Type, Int8Type};
2929
use datafusion_common::utils::take_function_args;
3030
use datafusion_common::{internal_err, Result};
3131
use datafusion_expr::{
@@ -71,7 +71,20 @@ impl ScalarUDFImpl for BitmapCount {
7171
}
7272

7373
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
74-
Ok(DataType::Int64)
74+
internal_err!("return_field_from_args should be used instead")
75+
}
76+
77+
fn return_field_from_args(
78+
&self,
79+
args: datafusion_expr::ReturnFieldArgs,
80+
) -> Result<FieldRef> {
81+
use arrow::datatypes::Field;
82+
// bitmap_count returns Int64 with the same nullability as the input
83+
Ok(Arc::new(Field::new(
84+
args.arg_fields[0].name(),
85+
DataType::Int64,
86+
args.arg_fields[0].is_nullable(),
87+
)))
7588
}
7689

7790
fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
@@ -224,4 +237,37 @@ mod tests {
224237
assert_eq!(*actual.into_array(1)?, *expect.into_array(1)?);
225238
Ok(())
226239
}
240+
241+
#[test]
242+
fn test_bitmap_count_nullability() -> Result<()> {
243+
use datafusion_expr::ReturnFieldArgs;
244+
245+
let bitmap_count = BitmapCount::new();
246+
247+
// Test with non-nullable binary field
248+
let non_nullable_field = Arc::new(Field::new("bin", DataType::Binary, false));
249+
250+
let result = bitmap_count.return_field_from_args(ReturnFieldArgs {
251+
arg_fields: &[Arc::clone(&non_nullable_field)],
252+
scalar_arguments: &[None],
253+
})?;
254+
255+
// The result should not be nullable (same as input)
256+
assert!(!result.is_nullable());
257+
assert_eq!(result.data_type(), &Int64);
258+
259+
// Test with nullable binary field
260+
let nullable_field = Arc::new(Field::new("bin", DataType::Binary, true));
261+
262+
let result = bitmap_count.return_field_from_args(ReturnFieldArgs {
263+
arg_fields: &[Arc::clone(&nullable_field)],
264+
scalar_arguments: &[None],
265+
})?;
266+
267+
// The result should be nullable (same as input)
268+
assert!(result.is_nullable());
269+
assert_eq!(result.data_type(), &Int64);
270+
271+
Ok(())
272+
}
227273
}

0 commit comments

Comments
 (0)