Commit 4fb36b2
authored
fix: bitmap_count should report nullability correctly (#19195)
## Which issue does this PR close?
Closes #19146
Part of #19144 (EPIC: fix nullability report for spark expression)
## Rationale for this change
The `bitmap_count` UDF was using the default `return_type`
implementation which does not preserve nullability information. This
causes:
1. **Incorrect schema inference** - non-nullable binary inputs are
incorrectly marked as producing nullable Int64 outputs
2. **Missed optimization opportunities** - the query optimizer cannot
apply nullability-based optimizations when metadata is incorrect
3. **Inconsistent behavior** - other similar functions preserve
nullability, leading to unexpected differences
The `bitmap_count` function counts set bits in a binary input and
returns an Int64. The operation itself doesn't introduce nullability -
if the input is non-nullable, the output will always be non-nullable.
Therefore, the output nullability should match the input.
## What changes are included in this PR?
1. **Implemented `return_field_from_args`**: Creates a field with Int64
type and the same nullability as the input field
2. **Updated `return_type`**: Now returns an error directing users to
use `return_field_from_args` instead (following DataFusion best
practices)
3. **Added `FieldRef` import** to support returning field metadata
4. **Added nullability test**: Verifies that both nullable and
non-nullable inputs are handled correctly
## Are these changes tested?
Yes, this PR includes a new test `test_bitmap_count_nullability` that
verifies:
- Non-nullable Binary input produces non-nullable Int64 output
- Nullable Binary input produces nullable Int64 output
- Data type is correctly set to Int64 in both cases
Test results:
```
running 1 test
test function::bitmap::bitmap_count::tests::test_bitmap_count_nullability ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured
```
Additionally, all existing `bitmap_count` tests continue to pass,
ensuring backward compatibility.
## Are there any user-facing changes?
**Yes - Schema metadata improvement:**
Users will now see correct nullability information in the schema:
**Before (Bug):**
- Non-nullable binary input → nullable Int64 output (incorrect)
**After (Fixed):**
- Non-nullable binary input → non-nullable Int64 output (correct)
- Nullable binary input → nullable Int64 output (correct)
This is a **bug fix** that corrects schema metadata only - it does not
change the actual computation or introduce any breaking changes to the
API.
**Impact:**
- Query optimizers can now make better decisions based on accurate
nullability information
- Schema validation will be more accurate
- No changes to function behavior or output values
---
## Code Changes Summary
### Modified File:
`datafusion/spark/src/function/bitmap/bitmap_count.rs`
#### 1. Added Import
```rust
use arrow::datatypes::{DataType, FieldRef, Int16Type, Int32Type, Int64Type, Int8Type};
```
#### 2. Updated return_type Method
```rust
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
internal_err!("return_field_from_args should be used instead")
}
```
#### 3. Added return_field_from_args Implementation
```rust
fn return_field_from_args(
&self,
args: datafusion_expr::ReturnFieldArgs,
) -> Result<FieldRef> {
use arrow::datatypes::Field;
// bitmap_count returns Int64 with the same nullability as the input
Ok(Arc::new(Field::new(
args.arg_fields[0].name(),
DataType::Int64,
args.arg_fields[0].is_nullable(),
)))
}
```
#### 4. Added Test
```rust
#[test]
fn test_bitmap_count_nullability() -> Result<()> {
use datafusion_expr::ReturnFieldArgs;
let bitmap_count = BitmapCount::new();
// Test with non-nullable binary field
let non_nullable_field = Arc::new(Field::new("bin", DataType::Binary, false));
let result = bitmap_count.return_field_from_args(ReturnFieldArgs {
arg_fields: &[Arc::clone(&non_nullable_field)],
scalar_arguments: &[None],
})?;
// The result should not be nullable (same as input)
assert!(!result.is_nullable());
assert_eq!(result.data_type(), &Int64);
// Test with nullable binary field
let nullable_field = Arc::new(Field::new("bin", DataType::Binary, true));
let result = bitmap_count.return_field_from_args(ReturnFieldArgs {
arg_fields: &[Arc::clone(&nullable_field)],
scalar_arguments: &[None],
})?;
// The result should be nullable (same as input)
assert!(result.is_nullable());
assert_eq!(result.data_type(), &Int64);
Ok(())
}
```
---
## Verification Steps
1. **Run the new test:**
```bash
cargo test -p datafusion-spark test_bitmap_count_nullability --lib
```
2. **Run all bitmap_count tests:**
```bash
cargo test -p datafusion-spark bitmap_count --lib
```
3. **Run clippy checks:**
```bash
cargo clippy -p datafusion-spark --all-targets -- -D warnings
```
All checks pass successfully!
---
## Related Issues
- Closes: #19146
- Part of EPIC: #19144 (fix nullability report for spark expression)
- Similar fix: #19145 (shuffle function nullability)
---1 parent 2626fc2 commit 4fb36b2
1 file changed
+48
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | | - | |
| 28 | + | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| |||
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
74 | | - | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
75 | 88 | | |
76 | 89 | | |
77 | 90 | | |
| |||
224 | 237 | | |
225 | 238 | | |
226 | 239 | | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
227 | 273 | | |
0 commit comments