Skip to content

Commit f734ec5

Browse files
KARTIK64-rgbalamb
andauthored
Fix Subtraction overflow in max_distinct_count when hash join has a pushed-down limit (apache#20799)
## Which issue does this PR close? - Closes apache#20779. ## Rationale for this change In `max_distinct_count` (inside `datafusion/physical-plan/src/joins/utils.rs`), the `Precision::Exact` branch computes the number of non-null rows by doing: ```rust let count = count - stats.null_count.get_value().unwrap_or(&0); ``` Before apache#20228 this subtraction was always safe because `num_rows` was never smaller than `null_count`. But apache#20228 added `fetch` (limit push-down) support to `HashJoinExec`, and when a limit is applied, `partition_statistics()` caps `num_rows` to `Exact(fetch_value)` without also capping the per-column `null_count`. This means `null_count` can legally exceed `num_rows`, causing a panic with *"attempt to subtract with overflow"*. ## What changes are included in this PR? - **Bug fix** in `max_distinct_count` (`utils.rs` ~line 725): replaced the bare subtraction with a saturating subtraction so that when `null_count` exceeds `num_rows` the result is clamped to `0` instead of panicking. ```rust // Before let count = count - stats.null_count.get_value().unwrap_or(&0); // After let count = count.saturating_sub(*stats.null_count.get_value().unwrap_or(&0)); ``` - **Regression test** added at the bottom of the `mod tests` block in the same file. The test deliberately constructs a scenario where `null_count (5) > num_rows (2)` and asserts that `max_distinct_count` returns `Exact(0)` without panicking. ## Are these changes tested? Yes. A new unit test `test_max_distinct_count_no_overflow_when_null_count_exceeds_num_rows` is added directly in `datafusion/physical-plan/src/joins/utils.rs`. It covers the exact edge-case from the bug report (null_count > num_rows after a fetch/limit push-down) and would have caught the panic before the fix. ## Are there any user-facing changes? No user-facing or API changes. This is a purely internal arithmetic fix in the statistics estimation logic. Queries that previously panicked when a limit was pushed down into a `HashJoinExec` will now complete successfully. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
1 parent 8f721a6 commit f734ec5

File tree

2 files changed

+42
-3
lines changed

2 files changed

+42
-3
lines changed

datafusion/physical-plan/src/joins/utils.rs

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -722,11 +722,12 @@ fn max_distinct_count(
722722
}
723723
}
724724
Precision::Exact(count) => {
725-
let count = count - stats.null_count.get_value().unwrap_or(&0);
725+
let null_count = *stats.null_count.get_value().unwrap_or(&0);
726+
let non_null_count = count.checked_sub(null_count).unwrap_or(0);
726727
if stats.null_count.is_exact().unwrap_or(false) {
727-
Precision::Exact(count)
728+
Precision::Exact(non_null_count)
728729
} else {
729-
Precision::Inexact(count)
730+
Precision::Inexact(non_null_count)
730731
}
731732
}
732733
};
@@ -2939,4 +2940,19 @@ mod tests {
29392940

29402941
Ok(())
29412942
}
2943+
2944+
#[test]
2945+
fn test_max_distinct_count_no_overflow_when_null_count_exceeds_num_rows() {
2946+
let num_rows = Exact(2);
2947+
let stats = ColumnStatistics {
2948+
distinct_count: Absent,
2949+
null_count: Exact(5),
2950+
min_value: Absent,
2951+
max_value: Absent,
2952+
sum_value: Absent,
2953+
byte_size: Absent,
2954+
};
2955+
let result = max_distinct_count(&num_rows, &stats);
2956+
assert_eq!(result, Exact(0));
2957+
}
29422958
}

datafusion/sqllogictest/test_files/joins.slt

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5404,3 +5404,26 @@ DROP TABLE t1;
54045404

54055405
statement count 0
54065406
DROP TABLE t2;
5407+
5408+
statement ok
5409+
CREATE TABLE t1(a INT, b INT) AS VALUES
5410+
(NULL, 1), (NULL, 2), (NULL, 3), (NULL, 4), (NULL, 5);
5411+
5412+
statement ok
5413+
CREATE TABLE t2(c INT) AS VALUES (1), (2);
5414+
5415+
# This query panicked before the fix: the ORDER BY forces a SortExec,
5416+
# the LIMIT gets pushed into SortExec.fetch, and the HashJoinExec
5417+
# calls partition_statistics() on the SortExec child during execution.
5418+
query II
5419+
SELECT sub.a, sub.b FROM (
5420+
SELECT * FROM t1 ORDER BY b LIMIT 1
5421+
) sub
5422+
JOIN t2 ON sub.a = t2.c;
5423+
----
5424+
5425+
statement ok
5426+
DROP TABLE t1;
5427+
5428+
statement ok
5429+
DROP TABLE t2;

0 commit comments

Comments
 (0)