Skip to content

Commit 82af167

Browse files
viiryaclaude
andcommitted
Optimize null-aware anti join to skip CollectLeft when keys are non-nullable
When all join keys are non-nullable on both sides, we don't need null-aware semantics because NULLs cannot exist in the data. This allows the query to use regular Partitioned mode instead of the more expensive CollectLeft mode. Implementation: - Added join_keys_may_be_null() helper function that checks schema nullability - Modified null_aware flag logic to only enable when: 1. It's a NOT IN subquery (not NOT EXISTS) 2. AND at least one join key column is nullable Benefits: - Queries with NOT NULL constraints can use Partitioned mode (better parallelism) - Avoids unnecessary CollectLeft overhead when null-aware semantics aren't needed - Regular anti join is cheaper than null-aware (no atomic flag synchronization) Example: SELECT * FROM t1 WHERE id NOT IN (SELECT id FROM t2) - If t1.id and t2.id are NOT NULL: uses regular anti join with Partitioned mode - If either is nullable: uses null-aware anti join with CollectLeft mode Addresses review comment on join_selection.rs L251 by detecting nullability earlier in the optimizer rather than in the physical optimizer. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
1 parent 5f9249b commit 82af167

File tree

1 file changed

+40
-3
lines changed

1 file changed

+40
-3
lines changed

datafusion/optimizer/src/decorrelate_predicate_subquery.rs

Lines changed: 40 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ use crate::{OptimizerConfig, OptimizerRule};
2727

2828
use datafusion_common::alias::AliasGenerator;
2929
use datafusion_common::tree_node::{Transformed, TransformedResult, TreeNode};
30-
use datafusion_common::{Column, NullEquality, Result, assert_or_internal_err, plan_err};
30+
use datafusion_common::{Column, DFSchemaRef, ExprSchema, NullEquality, Result, assert_or_internal_err, plan_err};
3131
use datafusion_expr::expr::{Exists, InSubquery};
3232
use datafusion_expr::expr_rewriter::create_col_from_scalar_expr;
3333
use datafusion_expr::logical_plan::{JoinType, Subquery};
@@ -310,6 +310,39 @@ fn mark_join(
310310
)
311311
}
312312

313+
/// Check if join keys in the join filter may contain NULL values
314+
///
315+
/// Returns true if any join key column is nullable on either side.
316+
/// This is used to optimize null-aware anti joins: if all join keys are non-nullable,
317+
/// we can use a regular anti join instead of the more expensive null-aware variant.
318+
fn join_keys_may_be_null(
319+
join_filter: &Expr,
320+
left_schema: &DFSchemaRef,
321+
right_schema: &DFSchemaRef,
322+
) -> Result<bool> {
323+
// Extract columns from the join filter
324+
let mut columns = std::collections::HashSet::new();
325+
expr_to_columns(join_filter, &mut columns)?;
326+
327+
// Check if any column is nullable
328+
for col in columns {
329+
// Check in left schema
330+
if let Ok(field) = left_schema.field_from_column(&col) {
331+
if field.as_ref().is_nullable() {
332+
return Ok(true);
333+
}
334+
}
335+
// Check in right schema
336+
if let Ok(field) = right_schema.field_from_column(&col) {
337+
if field.as_ref().is_nullable() {
338+
return Ok(true);
339+
}
340+
}
341+
}
342+
343+
Ok(false)
344+
}
345+
313346
fn build_join(
314347
left: &LogicalPlan,
315348
subquery: &LogicalPlan,
@@ -422,8 +455,12 @@ fn build_join(
422455
// - NOT IN: Uses three-valued logic, requires null-aware handling
423456
// - NOT EXISTS: Uses two-valued logic, regular anti join is correct
424457
// We can distinguish them: NOT IN has in_predicate_opt, NOT EXISTS does not
425-
let null_aware =
426-
matches!(join_type, JoinType::LeftAnti) && in_predicate_opt.is_some();
458+
//
459+
// Additionally, if the join keys are non-nullable on both sides, we don't need
460+
// null-aware semantics because NULLs cannot exist in the data.
461+
let null_aware = matches!(join_type, JoinType::LeftAnti)
462+
&& in_predicate_opt.is_some()
463+
&& join_keys_may_be_null(&join_filter, left.schema(), sub_query_alias.schema())?;
427464

428465
// join our sub query into the main plan
429466
let new_plan = if null_aware {

0 commit comments

Comments
 (0)