feat: Add null-aware anti join support #19635

viirya · 2026-01-04T08:56:06Z

Which issue does this PR close?

Closes DataFusion HashJoin LeftAnti doesn't support null aware anti join #10583.

Rationale for this change

What changes are included in this PR?

This patch implements null-aware anti join support for HashJoin LeftAnti operations, enabling correct SQL NOT IN subquery semantics with NULL values.

Are these changes tested?

Are there any user-facing changes?

viirya · 2026-01-04T18:28:35Z

datafusion/sqllogictest/test_files/joins.slt

 query IT
 SELECT t1_id, t1_name FROM join_test_left WHERE t1_id NOT IN (SELECT t2_id FROM join_test_right) ORDER BY t1_id;
 ----
-NULL e


The existing test was expecting (NULL, 'e') to be returned by a NOT IN query when the subquery contains NULL values. This is incorrect according to SQL semantics.

comphead · 2026-01-04T21:41:19Z

Thanks @viirya for taking care on this, I'll check this out early next week!

datafusion/expr/src/logical_plan/plan.rs

datafusion/proto/src/physical_plan/mod.rs

datafusion/sqllogictest/test_files/null_aware_anti_join.slt

datafusion/physical-plan/src/joins/hash_join/stream.rs

comphead · 2026-01-06T16:38:43Z

datafusion/sqllogictest/test_files/null_aware_anti_join.slt

+2 b
+3 c
+4 d
+NULL e


NULL, e

shouldn't be here

D SELECT * FROM outer_table WHERE id NOT IN (SELECT id FROM inner_table_no_null) OR id NOT IN (SELECT id FROM inner_table2); ┌───────┬─────────┐ │ id │ value │ │ int32 │ varchar │ ├───────┼─────────┤ │ 1 │ a │ │ 3 │ c │ │ 2 │ b │ │ 4 │ d │ └───────┴─────────┘

The test expectation was indeed incorrect according to SQL semantics.

The Problem

Test 9 has the query:
SELECT * FROM outer_table
WHERE id NOT IN (SELECT id FROM inner_table_no_null)
OR id NOT IN (SELECT id FROM inner_table2);

For the NULL row:

NULL NOT IN (2, 4) = UNKNOWN

NULL NOT IN (1, 3) = UNKNOWN

UNKNOWN OR UNKNOWN = UNKNOWN → should be filtered out

But the test was expecting (NULL, 'e') to be included, which is wrong.

Root Cause

When NOT IN subqueries appear in OR conditions, DataFusion uses RightMark joins instead of LeftAnti joins:

Mark joins add a boolean "mark" column indicating whether each row had a match

The filter then evaluates NOT mark OR NOT mark

The problem: Mark joins treat NULL keys as non-matching (FALSE) instead of UNKNOWN

This causes NOT FALSE OR NOT FALSE = TRUE, incorrectly including the NULL row

Why This Happens

Mark joins are designed to handle complex boolean expressions (like OR) by converting the subquery check into a boolean column. However, they don't implement null-aware semantics - the mark column is never NULL, even when it should be UNKNOWN due to NULL join keys.

The Solution (For Now)

The proper fix would be to implement null-aware support for mark joins, making the mark column nullable and setting it to NULL when join keys are NULL. However, this is a complex change that affects the core join implementation.

For now, I've:

Kept the test as-is (returning NULL row)

Added detailed comments documenting this as a KNOWN LIMITATION

Marked it as a TODO for future implementation

This way, the limitation is clearly documented and users/developers are aware of the issue, while we can address it properly in a future enhancement.

The above is the analysis from AI. I think that's said that the test expectation failure is on mark joins instead of the null-aware anti joins in this PR, i.e., it is an existing bug.

Why We Cannot Simply Use LeftAnti Joins

Short Answer: Because LeftAnti joins filter rows immediately, while OR conditions need to evaluate boolean expressions from multiple subqueries simultaneously.

The Fundamental Difference:

LeftAnti Join (filtering):
SELECT * FROM outer_table
WHERE id NOT IN (SELECT id FROM subquery)
- The join filters out matching rows directly
- Result: rows that don't match

OR Condition (boolean evaluation):
SELECT * FROM outer_table
WHERE id NOT IN (SELECT id FROM subquery1)
OR id NOT IN (SELECT id FROM subquery2)
- Need boolean values from BOTH subqueries
- Then evaluate: NOT match1 OR NOT match2
- Can't do this with filtering joins alone

Why Mark Joins Are Used:

Mark joins add a boolean column instead of filtering

This allows complex boolean expressions like OR, AND, NOT to be evaluated in a subsequent Filter operator

Example: WHERE (NOT mark1 OR NOT mark2) AND other_condition

The Current Problem:

Mark joins don't support null-aware semantics

They set mark = FALSE when no match, but should set mark = NULL when join key is NULL

Why It's Complex to Fix:

The mark column is created deep in the join execution code (build_batch_from_indices)

That function doesn't currently have access to:

The null_aware flag

The join key columns (to check if they're NULL)

Would require threading these through multiple layers of the codebase

We can't use LeftAnti because it filters instead of producing boolean values, and implementing null-aware mark joins requires significant refactoring of the join execution internals.

I will leave it to future work.

datafusion/sqllogictest/test_files/null_aware_anti_join.slt

This commit implements Phase 1 of null-aware anti join support for HashJoin LeftAnti operations, enabling correct SQL NOT IN subquery semantics with NULL values. - Add `null_aware: bool` field to HashJoinExec struct - Add validation: null_aware only for LeftAnti, single-column joins - Update all HashJoinExec::try_new() call sites (17 locations) - Add `probe_side_has_null` flag to track NULLs in probe side - Implement NULL detection during probe phase - Filter NULL-key rows during final emission stage - Add early exit when probe side contains NULL - Add 5 test functions with 17 test variants - Test scenarios: probe NULL, build NULL, no NULLs, validation - Add helper function `build_table_two_cols()` for nullable test data For `SELECT * FROM t1 WHERE c1 NOT IN (SELECT c2 FROM t2)`: 1. If c2 contains NULL → return 0 rows (three-valued logic) 2. If c1 is NULL → that row not in output 3. No NULLs → standard anti join behavior - Single-column join keys only - Must manually set null_aware=true (no planner integration yet) - LeftAnti join type only - All 17 null-aware tests passing - All 610 hash join tests passing Addresses issue apache#10583 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

This commit implements Phase 2 of null-aware anti join support, enabling automatic detection and configuration of null-aware semantics for SQL NOT IN subqueries. DataFusion now automatically provides correct SQL NOT IN semantics with three-valued logic. When users write NOT IN subqueries, the optimizer automatically detects them and enables null-aware execution. - Added `null_aware: bool` field to `Join` struct in logical plan - Updated `Join::try_new()` and related APIs to accept null_aware parameter - Added `LogicalPlanBuilder::join_detailed_with_options()` for explicit null_aware control - Updated all Join construction sites across the codebase - Modified `DecorrelatePredicateSubquery` optimizer to automatically set `null_aware: true` for LeftAnti joins (NOT IN subqueries) - Uses new `join_detailed_with_options()` API to pass the flag - Conservative approach: all LeftAnti joins use null-aware semantics - Added checks in `JoinSelection` physical optimizer to prevent swapping null-aware anti joins - Null-aware LeftAnti joins cannot be swapped to RightAnti because: - Validation only allows LeftAnti with null_aware=true - NULL-handling semantics are asymmetric between sides - Added checks in 5 locations: try_collect_left, partitioned_hash_join, partition mode optimization, and hash_join_swap_subrule - Added new SQL logic test file with 13 comprehensive test scenarios - Tests cover: NULL in subquery, NULL in outer table, empty subquery, complex expressions, multiple NOT IN conditions, correlated subqueries - Includes EXPLAIN tests to verify correct plan generation - All existing optimizer and hash join tests continue to pass - datafusion/expr/src/logical_plan/plan.rs - datafusion/expr/src/logical_plan/builder.rs - datafusion/expr/src/logical_plan/tree_node.rs - datafusion/optimizer/src/decorrelate_predicate_subquery.rs - datafusion/optimizer/src/eliminate_cross_join.rs - datafusion/optimizer/src/eliminate_outer_join.rs - datafusion/optimizer/src/extract_equijoin_predicate.rs - datafusion/physical-optimizer/src/join_selection.rs - datafusion/physical-optimizer/src/enforce_distribution.rs - datafusion/core/src/physical_planner.rs - datafusion/proto/src/physical_plan/mod.rs - datafusion/sqllogictest/test_files/null_aware_anti_join.slt (new) Before (Phase 1 - manual): ```rust HashJoinExec::try_new(..., true /* null_aware */) ``` After (Phase 2 - automatic): ```sql SELECT * FROM orders WHERE order_id NOT IN (SELECT order_id FROM cancelled) ``` The optimizer automatically handles null-aware semantics. - SQL logic tests: All passed - Optimizer tests: 568 passed - Hash join tests: 610 passed - Physical optimizer tests: 16 passed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

The previous implementation incorrectly applied null-aware semantics to ALL LeftAnti joins, including NOT EXISTS subqueries. This was wrong because: - **NOT IN**: Uses three-valued logic (TRUE/FALSE/UNKNOWN), requires null-aware - **NOT EXISTS**: Uses two-valued logic (TRUE/FALSE), should NOT be null-aware ```sql -- Setup: customers has (1, 2, 3, NULL), banned has (2, NULL) -- NOT IN - Correctly returns empty (null-aware) SELECT * FROM customers WHERE id NOT IN (SELECT id FROM banned); -- Result: Empty (correct - NULL in subquery makes all comparisons UNKNOWN) -- NOT EXISTS - Was incorrectly returning empty (bug) SELECT * FROM customers c WHERE NOT EXISTS (SELECT 1 FROM banned b WHERE c.id = b.id); -- Expected: (1, 3, NULL) - NULL=NULL is FALSE, so no matches for these rows -- Actual (buggy): Empty - incorrectly using null-aware semantics ``` In `decorrelate_predicate_subquery.rs`, line 424: ```rust let null_aware = matches!(join_type, JoinType::LeftAnti); ``` This set `null_aware=true` for ALL LeftAnti joins, but it should only be true for NOT IN (InSubquery), not NOT EXISTS (Exists). The `SubqueryInfo` struct already distinguishes between them: - **NOT IN**: Created with `new_with_in_expr()` → `in_predicate_opt` is `Some(...)` - **NOT EXISTS**: Created with `new()` → `in_predicate_opt` is `None` Fixed by checking both conditions: ```rust let null_aware = matches!(join_type, JoinType::LeftAnti) && in_predicate_opt.is_some(); // Only NOT IN, not NOT EXISTS ``` **File**: `datafusion/optimizer/src/decorrelate_predicate_subquery.rs` - Updated null_aware detection to only apply to NOT IN (lines 420-426) - Added comprehensive comments explaining the distinction - Check `in_predicate_opt.is_some()` to distinguish NOT IN from NOT EXISTS **File**: `datafusion/sqllogictest/test_files/null_aware_anti_join.slt` Added 5 new test scenarios (Tests 14-18): **Test 14**: Direct comparison of NOT IN vs NOT EXISTS with NULLs - NOT IN with NULL → empty result (null-aware) - NOT EXISTS with NULL → returns non-matching rows (NOT null-aware) - EXPLAIN verification **Test 15**: NOT EXISTS with no NULLs **Test 16**: NOT EXISTS with correlated subquery **Test 17**: NOT EXISTS with all-NULL subquery - Shows that NOT EXISTS returns all rows (NULL=NULL is FALSE) - Compares with NOT IN which correctly returns empty **Test 18**: Nested NOT EXISTS and NOT IN - Verifies correct interaction between the two ```bash cargo test -p datafusion-sqllogictest --test sqllogictests -- null_aware_anti_join cargo test -p datafusion-sqllogictest --test sqllogictests subquery.slt cargo test -p datafusion-optimizer --lib cargo test -p datafusion-physical-plan --lib hash_join ``` This fix ensures DataFusion correctly implements SQL semantics: - NOT IN subqueries now correctly use null-aware anti join (three-valued logic) - NOT EXISTS subqueries now correctly use regular anti join (two-valued logic) Users can now reliably use both NOT IN and NOT EXISTS with confidence that NULL handling follows SQL standards. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Fixed compilation errors in plan.rs test code that were missing the null_aware parameter in Join::try_new() calls and direct Join struct construction. Changes: - Added null_aware: false to 7 Join::try_new() calls in test functions - Added null_aware: false to 1 direct Join struct construction All tests pass except for one pre-existing failure in expr_rewriter::order_by::test::rewrite_sort_cols_by_agg_alias which is unrelated to null-aware joins.

Fixed compilation errors in datafusion/core test files that were missing the null_aware parameter in HashJoinExec::try_new() calls. Changes: - datafusion/core/tests/execution/coop.rs: Fixed 2 instances - datafusion/core/tests/physical_optimizer/test_utils.rs: Fixed 1 instance All instances now pass null_aware=false since these are generic test utilities not specifically testing null-aware anti join functionality.

Fixed 30 HashJoinExec::try_new() calls across 5 test files that were missing the null_aware parameter (9th parameter). Changes: - datafusion/core/tests/physical_optimizer/projection_pushdown.rs: 3 calls - datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: 15 calls - datafusion/core/tests/physical_optimizer/join_selection.rs: 10 calls - datafusion/core/tests/physical_optimizer/replace_with_order_preserving_variants.rs: 1 call - datafusion/core/tests/fuzz_cases/join_fuzz.rs: 1 call All instances now pass null_aware=false as these are generic test utilities not specifically testing null-aware anti join functionality.

Fixed 3 additional HashJoinExec::try_new() calls that were missed in the previous commit. Changes: - datafusion/core/tests/execution/coop.rs: 2 calls (lines 715, 749) - datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: 1 call (line 3575) All instances now pass null_aware=false.

…behavior The test was expecting (NULL, 'e') to be returned by a NOT IN query when the subquery contains NULL values. This is incorrect according to SQL semantics. With null-aware anti join (three-valued logic), when the subquery contains ANY NULL value, the NOT IN expression evaluates to UNKNOWN for all rows, which are filtered out by the WHERE clause, resulting in an empty set. This is the correct SQL NOT IN behavior and validates that our null-aware anti join implementation is working properly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Fixed two clippy warnings: 1. doc_lazy_continuation: Added blank lines to properly separate doc comment paragraphs for the null_aware field documentation 2. too_many_arguments: Added #[expect(clippy::too_many_arguments)] attribute to Join::try_new since 8 parameters are necessary for complete join specification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Fixed a bug where NULL rows were incorrectly filtered out when the subquery in a NOT IN clause was empty. According to SQL semantics: - NULL NOT IN (empty set) = TRUE (should return the NULL row) - NULL NOT IN (..., NULL, ...) = UNKNOWN (should NOT return the NULL row) - NULL NOT IN (2, 4) = UNKNOWN (should NOT return the NULL row) The bug was that the implementation unconditionally filtered out LEFT rows with NULL keys in null-aware anti joins, even when the probe side (subquery) was empty. The fix introduces a new flag `probe_side_non_empty` to track whether any probe batches were processed. NULL keys are now only filtered out when the probe side is non-empty, correctly implementing the SQL NOT IN semantics for empty subqueries. Changes: - Added `probe_side_non_empty` field to HashJoinStream - Set flag to true when processing probe batches - Only filter NULL keys if probe side was non-empty - Updated Test 5 to expect NULL row in result 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Test 9 demonstrates a known limitation where mark joins used for OR conditions with NOT IN subqueries don't properly implement null-aware semantics. The issue: - When a query has "NOT IN (subquery1) OR NOT IN (subquery2)", the optimizer uses RightMark joins instead of LeftAnti joins - Mark joins add a boolean column indicating matches but treat NULL keys as non-matching (FALSE) rather than UNKNOWN - This causes incorrect results: NULL rows are returned when they should be filtered out According to SQL semantics: - NULL NOT IN (values) = UNKNOWN - UNKNOWN OR UNKNOWN = UNKNOWN (filtered by WHERE) Current behavior: - NULL mark = FALSE - NOT FALSE OR NOT FALSE = TRUE (incorrectly included) The correct fix would be to implement null-aware support for mark joins, which would require the mark column to be nullable and set to NULL when join keys are NULL. This is a more complex change that should be addressed separately. For now, the test documents this limitation with detailed comments explaining the issue and marking it as a TODO. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Fixed an issue where probe_side_non_empty was being set even for empty batches (batches with 0 rows), which could cause incorrect behavior in null-aware anti joins. The bug: process_probe_batch was unconditionally setting probe_side_non_empty = true, even when the batch had 0 rows. This could lead to incorrectly filtering out NULL rows from the left side when the probe side was actually empty (just had empty batches as artifacts of streaming). The fix: Only set probe_side_non_empty = true when batch.num_rows() > 0, ensuring we only consider the probe side as non-empty when it actually contains data rows. This fixes a CI test failure in Test 10 where the subquery filtered down to non-empty results, but empty batches were being processed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Null-aware anti joins must use PartitionMode::CollectLeft instead of PartitionMode::Partitioned because they track probe-side state (probe_side_non_empty, probe_side_has_null) per-partition, but require global knowledge for correct NULL handling. The problem with partitioned mode: - Hash joins partition rows by hash(join_key) - Row with NULL key goes to partition X (hash(NULL)) - Row with value 2 goes to partition Y (hash(2)) - Partition X doesn't see any probe rows, even though probe side is globally non-empty - This causes partition X to incorrectly return NULL rows Example that failed in CI: SELECT * FROM outer_table WHERE id NOT IN (SELECT id FROM inner WHERE value = 'x'); - Subquery returns [2] - Row (NULL, 'e') from outer_table hashes to different partition than 2 - That partition sees no probe rows and incorrectly returns (NULL, 'e') The fix: - Force PartitionMode::CollectLeft for null-aware anti joins - This collects the left side (outer table) into a single partition - All partitions see the same complete probe side - Correct global state tracking for null handling Trade-off: Null-aware anti joins lose parallelism on the build side, but gain correctness. This is acceptable since null-aware anti joins are typically used for NOT IN subqueries which are less common and often involve smaller datasets. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Added an additional check in the physical planner to prevent null-aware anti joins from using PartitionMode::Auto. This ensures they use PartitionMode::CollectLeft from the start, before any optimizer passes. The issue: Even with the fix in join_selection.rs, the physical planner was creating null-aware joins with PartitionMode::Auto when target_partitions > 1 and repartition_joins is enabled (common in CI). The fix: Added `&& !*null_aware` condition to the partition mode decision in the physical planner, forcing null-aware joins to skip the Auto mode and go directly to CollectLeft. This provides defense-in-depth: 1. Physical planner: Creates with CollectLeft initially 2. Join selection optimizer: Ensures it stays CollectLeft 3. Stream execution: Has per-partition tracking as backup 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

…cution The previous implementation used per-partition flags to track probe side state, which caused incorrect results when hash partitioning distributed rows across multiple partitions. With CollectLeft mode, each output partition only had local knowledge of its own probe data, not global state. This commit fixes the issue by: 1. Adding shared AtomicBool flags to JoinLeftData (probe_side_non_empty, probe_side_has_null) 2. All partitions write to and read from these shared atomic flags 3. Ensures global knowledge of probe side state across all partitions Example of the bug: - With 16 partitions, NULL rows hash to partition 5, value 2 hashes to partition 12 - Partition 5 sees no probe data (local view: empty) - Partition 12 sees probe data (local view: non-empty) - If partition 5 outputs final results, it incorrectly returns NULL rows With shared atomic state, partition 5 now sees the global truth and correctly filters NULL rows when probe side is non-empty. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

This test verifies that NOT IN with NULL in the subquery result correctly returns an empty result set. The query tests the three-valued logic semantics: Query: SELECT * FROM test_table WHERE (c1 NOT IN (SELECT c2 FROM test_table)) = true Since the subquery result contains NULL, the NOT IN predicate evaluates to UNKNOWN (not TRUE) for all rows, resulting in an empty output. Test data: - test_table: (1,1), (2,2), (3,3), (4,NULL), (NULL,0) - Subquery returns: 1, 2, 3, NULL, 0 - Expected result: empty (because NULL in subquery makes all comparisons UNKNOWN) Fixes apache#10583 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

The correlated subquery from the issue: SELECT * FROM test_table t1 WHERE c1 NOT IN (SELECT c2 FROM test_table t2 WHERE t1.c1 = t2.c1) creates a multi-column join (correlation condition + NOT IN condition), which is not yet supported in Phase 1 of null-aware anti join implementation. Phase 1 only supports single column joins. Added a note documenting this known limitation and indicating it will be addressed in next Phase (multi-column support). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

…ewrite This commit addresses two review comments: 1. Preserve null_aware flag in Join::rewrite_with_exprs_and_inputs (plan.rs L906-947): - Previously the flag was destructured with `..` but hardcoded to `false` when reconstructing - Now explicitly extracts and preserves the flag value 2. Add null_aware to HashJoinExecNode protobuf (mod.rs L1242, L2236): - Added `bool null_aware = 10;` to HashJoinExecNode message in datafusion.proto - Updated serialization to write exec.null_aware - Updated deserialization to read hashjoin.null_aware - Regenerated protobuf code with regen.sh These changes ensure null_aware flag is correctly preserved during query optimization passes and serialization/deserialization for distributed execution. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Dandandan · 2026-01-09T08:39:10Z

run benchmarks

Dandandan · 2026-01-09T08:39:15Z

run benchmark tpch

alamb-ghbot · 2026-01-09T08:39:18Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing null-aware-anti-join (5f9249b) to 1f654bb diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb-ghbot · 2026-01-09T09:20:16Z

🤖: Benchmark completed

Details

Comparing HEAD and null-aware-anti-join
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ null-aware-anti-join ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2407.82 ms │           2382.47 ms │ no change │
│ QQuery 1     │   951.86 ms │            937.77 ms │ no change │
│ QQuery 2     │  1935.36 ms │           1876.74 ms │ no change │
│ QQuery 3     │  1154.56 ms │           1139.86 ms │ no change │
│ QQuery 4     │  2317.61 ms │           2296.67 ms │ no change │
│ QQuery 5     │ 28583.61 ms │          27855.79 ms │ no change │
│ QQuery 6     │  3867.34 ms │           4037.01 ms │ no change │
│ QQuery 7     │  3704.48 ms │           3695.20 ms │ no change │
└──────────────┴─────────────┴──────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                   ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                   │ 44922.63ms │
│ Total Time (null-aware-anti-join)   │ 44221.50ms │
│ Average Time (HEAD)                 │  5615.33ms │
│ Average Time (null-aware-anti-join) │  5527.69ms │
│ Queries Faster                      │          0 │
│ Queries Slower                      │          0 │
│ Queries with No Change              │          8 │
│ Queries with Failure                │          0 │
└─────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ null-aware-anti-join ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     1.42 ms │              1.44 ms │     no change │
│ QQuery 1     │    50.40 ms │             50.03 ms │     no change │
│ QQuery 2     │   131.51 ms │            131.79 ms │     no change │
│ QQuery 3     │   152.96 ms │            156.88 ms │     no change │
│ QQuery 4     │  1069.89 ms │           1088.65 ms │     no change │
│ QQuery 5     │  1326.76 ms │           1347.65 ms │     no change │
│ QQuery 6     │     1.43 ms │              1.46 ms │     no change │
│ QQuery 7     │    53.68 ms │             54.24 ms │     no change │
│ QQuery 8     │  1418.90 ms │           1448.18 ms │     no change │
│ QQuery 9     │  1719.81 ms │           1834.14 ms │  1.07x slower │
│ QQuery 10    │   337.22 ms │            356.06 ms │  1.06x slower │
│ QQuery 11    │   388.90 ms │            404.02 ms │     no change │
│ QQuery 12    │  1219.21 ms │           1296.91 ms │  1.06x slower │
│ QQuery 13    │  1915.90 ms │           1963.98 ms │     no change │
│ QQuery 14    │  1211.42 ms │           1247.97 ms │     no change │
│ QQuery 15    │  1226.80 ms │           1250.34 ms │     no change │
│ QQuery 16    │  2536.63 ms │           2568.27 ms │     no change │
│ QQuery 17    │  2520.73 ms │           2516.25 ms │     no change │
│ QQuery 18    │  6243.97 ms │           4849.76 ms │ +1.29x faster │
│ QQuery 19    │   119.19 ms │            118.55 ms │     no change │
│ QQuery 20    │  1959.83 ms │           1903.46 ms │     no change │
│ QQuery 21    │  2189.60 ms │           2193.72 ms │     no change │
│ QQuery 22    │  7486.44 ms │           3758.40 ms │ +1.99x faster │
│ QQuery 23    │ 12179.05 ms │          12186.88 ms │     no change │
│ QQuery 24    │   212.96 ms │            209.14 ms │     no change │
│ QQuery 25    │   469.73 ms │            459.41 ms │     no change │
│ QQuery 26    │   234.08 ms │            215.01 ms │ +1.09x faster │
│ QQuery 27    │  2676.09 ms │           2731.35 ms │     no change │
│ QQuery 28    │ 24689.84 ms │          23519.98 ms │     no change │
│ QQuery 29    │   975.64 ms │            953.67 ms │     no change │
│ QQuery 30    │  1324.77 ms │           1335.41 ms │     no change │
│ QQuery 31    │  1337.86 ms │           1329.54 ms │     no change │
│ QQuery 32    │  5445.94 ms │           5153.11 ms │ +1.06x faster │
│ QQuery 33    │  5870.02 ms │           5671.76 ms │     no change │
│ QQuery 34    │  5928.22 ms │           6190.21 ms │     no change │
│ QQuery 35    │  1915.92 ms │           1936.95 ms │     no change │
│ QQuery 36    │    64.18 ms │             66.90 ms │     no change │
│ QQuery 37    │    44.92 ms │             44.43 ms │     no change │
│ QQuery 38    │    64.22 ms │             67.25 ms │     no change │
│ QQuery 39    │   100.95 ms │            104.57 ms │     no change │
│ QQuery 40    │    27.05 ms │             25.89 ms │     no change │
│ QQuery 41    │    23.59 ms │             22.28 ms │ +1.06x faster │
│ QQuery 42    │    19.09 ms │             18.93 ms │     no change │
└──────────────┴─────────────┴──────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                   ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                   │ 98886.75ms │
│ Total Time (null-aware-anti-join)   │ 92784.82ms │
│ Average Time (HEAD)                 │  2299.69ms │
│ Average Time (null-aware-anti-join) │  2157.79ms │
│ Queries Faster                      │          5 │
│ Queries Slower                      │          3 │
│ Queries with No Change              │         35 │
│ Queries with Failure                │          0 │
└─────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ null-aware-anti-join ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 115.95 ms │            116.28 ms │     no change │
│ QQuery 2     │  26.77 ms │             29.69 ms │  1.11x slower │
│ QQuery 3     │  37.60 ms │             36.47 ms │     no change │
│ QQuery 4     │  27.88 ms │             28.94 ms │     no change │
│ QQuery 5     │  84.56 ms │             85.78 ms │     no change │
│ QQuery 6     │  19.92 ms │             19.66 ms │     no change │
│ QQuery 7     │ 235.55 ms │            223.49 ms │ +1.05x faster │
│ QQuery 8     │  31.59 ms │             35.00 ms │  1.11x slower │
│ QQuery 9     │ 102.25 ms │            105.29 ms │     no change │
│ QQuery 10    │  61.08 ms │             61.63 ms │     no change │
│ QQuery 11    │  16.24 ms │             18.73 ms │  1.15x slower │
│ QQuery 12    │  50.27 ms │             49.85 ms │     no change │
│ QQuery 13    │  47.29 ms │             46.31 ms │     no change │
│ QQuery 14    │  13.17 ms │             13.20 ms │     no change │
│ QQuery 15    │  23.94 ms │             23.76 ms │     no change │
│ QQuery 16    │  24.11 ms │             38.34 ms │  1.59x slower │
│ QQuery 17    │ 148.51 ms │            150.17 ms │     no change │
│ QQuery 18    │ 270.82 ms │            270.29 ms │     no change │
│ QQuery 19    │  38.34 ms │             36.61 ms │     no change │
│ QQuery 20    │  48.46 ms │             49.03 ms │     no change │
│ QQuery 21    │ 307.70 ms │            309.99 ms │     no change │
│ QQuery 22    │  17.40 ms │             17.00 ms │     no change │
└──────────────┴───────────┴──────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                   ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                   │ 1749.40ms │
│ Total Time (null-aware-anti-join)   │ 1765.52ms │
│ Average Time (HEAD)                 │   79.52ms │
│ Average Time (null-aware-anti-join) │   80.25ms │
│ Queries Faster                      │         1 │
│ Queries Slower                      │         4 │
│ Queries with No Change              │        17 │
│ Queries with Failure                │         0 │
└─────────────────────────────────────┴───────────┘

alamb-ghbot · 2026-01-09T09:20:21Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing null-aware-anti-join (5f9249b) to 1f654bb diff using: tpch
Results will be posted here when complete

alamb-ghbot · 2026-01-09T09:20:59Z

🤖: Benchmark completed

Details

Comparing HEAD and null-aware-anti-join
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ null-aware-anti-join ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 191.63 ms │            190.78 ms │    no change │
│ QQuery 2     │  93.96 ms │             95.03 ms │    no change │
│ QQuery 3     │ 127.58 ms │            121.76 ms │    no change │
│ QQuery 4     │  77.04 ms │             77.06 ms │    no change │
│ QQuery 5     │ 168.21 ms │            167.32 ms │    no change │
│ QQuery 6     │  67.04 ms │             65.98 ms │    no change │
│ QQuery 7     │ 211.05 ms │            216.05 ms │    no change │
│ QQuery 8     │ 158.59 ms │            163.92 ms │    no change │
│ QQuery 9     │ 224.12 ms │            224.05 ms │    no change │
│ QQuery 10    │ 183.98 ms │            182.04 ms │    no change │
│ QQuery 11    │  75.15 ms │             74.17 ms │    no change │
│ QQuery 12    │ 114.14 ms │            115.47 ms │    no change │
│ QQuery 13    │ 213.42 ms │            206.17 ms │    no change │
│ QQuery 14    │  88.63 ms │             95.27 ms │ 1.07x slower │
│ QQuery 15    │ 119.94 ms │            121.97 ms │    no change │
│ QQuery 16    │  54.94 ms │             62.49 ms │ 1.14x slower │
│ QQuery 17    │ 271.31 ms │            277.28 ms │    no change │
│ QQuery 18    │ 305.72 ms │            311.79 ms │    no change │
│ QQuery 19    │ 134.29 ms │            133.18 ms │    no change │
│ QQuery 20    │ 126.47 ms │            122.30 ms │    no change │
│ QQuery 21    │ 257.13 ms │            258.00 ms │    no change │
│ QQuery 22    │  40.41 ms │             45.26 ms │ 1.12x slower │
└──────────────┴───────────┴──────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                   ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                   │ 3304.76ms │
│ Total Time (null-aware-anti-join)   │ 3327.35ms │
│ Average Time (HEAD)                 │  150.22ms │
│ Average Time (null-aware-anti-join) │  151.24ms │
│ Queries Faster                      │         0 │
│ Queries Slower                      │         3 │
│ Queries with No Change              │        19 │
│ Queries with Failure                │         0 │
└─────────────────────────────────────┴───────────┘

Dandandan · 2026-01-09T09:22:05Z

│ QQuery 16 │ 24.11 ms │ 38.34 ms │ 1.59x slower │

hmmm...

Dandandan · 2026-01-09T09:28:28Z

datafusion/physical-optimizer/src/join_selection.rs

+        // (probe_side_non_empty, probe_side_has_null) per-partition, but need global knowledge
+        // for correct null handling. With partitioning, a partition might not see probe rows
+        // even if the probe side is globally non-empty, leading to incorrect NULL row handling.
+        let partition_mode = if hash_join.null_aware {


Can we avoid CollectLeft as fallback if the keys are not nullable or is this done already?

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) proto Related to proto crate physical-plan Changes to the physical-plan crate labels Jan 4, 2026

viirya mentioned this pull request Jan 4, 2026

DataFusion HashJoin LeftAnti doesn't support null aware anti join #10583

Open

viirya commented Jan 4, 2026

View reviewed changes

viirya changed the title ~~Null-aware LeftAnti Join~~ feat: Add null-aware anti join support Jan 4, 2026

viirya added the bug Something isn't working label Jan 4, 2026

martin-g reviewed Jan 6, 2026

View reviewed changes

datafusion/expr/src/logical_plan/plan.rs Outdated Show resolved Hide resolved

datafusion/proto/src/physical_plan/mod.rs Outdated Show resolved Hide resolved

datafusion/sqllogictest/test_files/null_aware_anti_join.slt Show resolved Hide resolved

martin-g reviewed Jan 6, 2026

View reviewed changes

datafusion/physical-plan/src/joins/hash_join/stream.rs Outdated Show resolved Hide resolved

comphead reviewed Jan 6, 2026

View reviewed changes

datafusion/sqllogictest/test_files/null_aware_anti_join.slt Show resolved Hide resolved

viirya force-pushed the null-aware-anti-join branch from 9523d19 to cdff5e2 Compare January 7, 2026 03:31

viirya and others added 13 commits January 6, 2026 20:13

fix format

deadd4a

fix format

335bde9

viirya force-pushed the null-aware-anti-join branch from cdff5e2 to 335bde9 Compare January 7, 2026 04:13

viirya and others added 4 commits January 7, 2026 11:29

viirya force-pushed the null-aware-anti-join branch from f5514c4 to dadc47c Compare January 7, 2026 19:29

viirya and others added 5 commits January 7, 2026 11:57

fix format

4f454de

update tpch q16 query plan

5f9249b

Dandandan reviewed Jan 9, 2026

View reviewed changes

+b
+c
+d
+              NULL e

feat: Add null-aware anti join support #19635

Are you sure you want to change the base?

feat: Add null-aware anti join support #19635

Uh oh!

Conversation

viirya commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

viirya Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead commented Jan 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

comphead Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

viirya Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

viirya Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

viirya Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Dandandan commented Jan 9, 2026

Uh oh!

Dandandan commented Jan 9, 2026

Uh oh!

alamb-ghbot commented Jan 9, 2026

Uh oh!

alamb-ghbot commented Jan 9, 2026

Uh oh!

alamb-ghbot commented Jan 9, 2026

Uh oh!

alamb-ghbot commented Jan 9, 2026

Uh oh!

Dandandan commented Jan 9, 2026

Uh oh!

Dandandan Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya commented Jan 4, 2026 •

edited

Loading

viirya Jan 4, 2026 •

edited

Loading

viirya Jan 7, 2026 •

edited

Loading