Skip to content

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Mar 20, 2025

Which issue does this PR close?

Rationale for this change

When working on unparsing the plan optimized by ScalarSubqueryToJoin, I noticed it could generate a LEFT JOIN without the join condition.

Projection: customer.c_custkey [c_custkey:Int64]
  Filter: customer.c_custkey = __scalar_sq_1.max(orders.o_custkey) [c_custkey:Int64, c_name:Utf8, max(orders.o_custkey):Int64;N]
    Left Join:  [c_custkey:Int64, c_name:Utf8, max(orders.o_custkey):Int64;N]
      TableScan: customer [c_custkey:Int64, c_name:Utf8]
      SubqueryAlias: __scalar_sq_1 [max(orders.o_custkey):Int64;N]
        Projection: max(orders.o_custkey) [max(orders.o_custkey):Int64;N]
          Aggregate: groupBy=[[]], aggr=[[max(orders.o_custkey)]] [max(orders.o_custkey):Int64;N]
            TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]

The unparsing result will look like

SELECT 
  customer.c_custkey 
FROM 
  customer 
  LEFT JOIN (
    SELECT 
      max(orders.o_custkey) 
    FROM 
      orders
  ) AS __scalar_sq_1
WHERE 
  (
    customer.c_custkey = __scalar_sq_1."max(orders.o_custkey)"
  )

Most databases don't allow the join without the condition besides the cross join.

DuckDB

D select * from c1 left join c2;
Parser Error:
syntax error at or near ";"

Postgres

test=# select * from t1 left join t2;
ERROR:  syntax error at or near ";"
LINE 1: select * from t1 left join t2;

DataFusion

> select * from t1 left join t2;
+----+----+
| c1 | c1 |
+----+----+
| 1  | 1  |
+----+----+
1 row(s) fetched. 
Elapsed 0.005 seconds.

I prefer to align this behavior with others. It makes the logical plan created by DataFusion reliable.
#13486 also mentioned that join without condition isn't a valid SQL.

Because the cross-join has the same IR as the inner join without the condition, #13486 should be prevented from the SQL layer. Maybe we can consider to reopen apache/datafusion-sqlparser-rs#1552

What changes are included in this PR?

  • Enforce JOIN plan to require conditions when built
  • ScalarSubqueryToJoin will generate the LEFT JOIN plan with the condition True.

Are these changes tested?

yes

Are there any user-facing changes?

Besides INNER JOIN, the join syntax without condition is not allowed anymore.

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 20, 2025
}

#[test]
fn limit_should_push_down_join_without_condition() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case has never occurred now.

@goldmedal goldmedal marked this pull request as ready for review March 20, 2025 15:32
@goldmedal goldmedal marked this pull request as draft March 20, 2025 16:15
@goldmedal goldmedal marked this pull request as ready for review March 20, 2025 16:33
@github-actions github-actions bot added the sql SQL Planner label Mar 20, 2025
@goldmedal goldmedal marked this pull request as draft March 22, 2025 06:19
@goldmedal goldmedal marked this pull request as ready for review March 22, 2025 07:39
08)--------------LeftAnti Join: customer.c_custkey = __correlated_sq_1.o_custkey
09)----------------Filter: substr(customer.c_phone, Int64(1), Int64(2)) IN ([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), Utf8View("30"), Utf8View("18"), Utf8View("17")])
10)------------------TableScan: customer projection=[c_custkey, c_phone, c_acctbal], partial_filters=[substr(customer.c_phone, Int64(1), Int64(2)) IN ([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), Utf8View("30"), Utf8View("18"), Utf8View("17")])]
10)------------------TableScan: customer projection=[c_custkey, c_phone, c_acctbal], partial_filters=[substr(customer.c_phone, Int64(1), Int64(2)) IN ([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), Utf8View("30"), Utf8View("18"), Utf8View("17")]), Boolean(true)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I benchmarked to ensure the plan change won't impact the performance.

--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃    main ┃ fix_join-check ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 68.97ms │        70.14ms │    no change │
│ QQuery 2     │ 18.72ms │        18.55ms │    no change │
│ QQuery 3     │ 30.47ms │        29.46ms │    no change │
│ QQuery 4     │ 21.45ms │        21.77ms │    no change │
│ QQuery 5     │ 43.53ms │        41.56ms │    no change │
│ QQuery 6     │ 14.39ms │        14.88ms │    no change │
│ QQuery 7     │ 55.03ms │        55.30ms │    no change │
│ QQuery 8     │ 41.24ms │        39.45ms │    no change │
│ QQuery 9     │ 50.84ms │        53.19ms │    no change │
│ QQuery 10    │ 44.85ms │        44.83ms │    no change │
│ QQuery 11    │ 13.42ms │        13.22ms │    no change │
│ QQuery 12    │ 28.49ms │        28.20ms │    no change │
│ QQuery 13    │ 29.56ms │        29.38ms │    no change │
│ QQuery 14    │ 23.03ms │        24.40ms │ 1.06x slower │
│ QQuery 15    │ 33.59ms │        34.98ms │    no change │
│ QQuery 16    │ 13.07ms │        13.08ms │    no change │
│ QQuery 17    │ 58.00ms │        58.67ms │    no change │
│ QQuery 18    │ 76.00ms │        75.06ms │    no change │
│ QQuery 19    │ 39.99ms │        38.40ms │    no change │
│ QQuery 20    │ 29.04ms │        29.79ms │    no change │
│ QQuery 21    │ 62.14ms │        64.14ms │    no change │
│ QQuery 22    │ 13.82ms │        14.01ms │    no change │
└──────────────┴─────────┴────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Benchmark Summary             ┃          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total Time (main)             │ 809.65ms │
│ Total Time (fix_join-check)   │ 812.45ms │
│ Average Time (main)           │  36.80ms │
│ Average Time (fix_join-check) │  36.93ms │
│ Queries Faster                │        0 │
│ Queries Slower                │        1 │
│ Queries with No Change        │       21 │
└───────────────────────────────┴──────────┘


query IIII
select * from testSubQueryLimit as t1 join (select * from testSubQueryLimit limit 1) limit 10;
select * from testSubQueryLimit as t1, (select * from testSubQueryLimit limit 1) limit 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that It was an invalid SQL mentioned by #13486. But indeed, we can modify it after #13486 is fixed. I revert this change here.


# join condition is required
query error join condition should not be empty
SELECT * FROM t1 FULL JOIN t2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include also a cross join without filters like

SELECT * FROM t1 CROSS JOIN t2

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @goldmedal

@alamb alamb merged commit 3e56ed2 into apache:main Mar 25, 2025
27 checks passed
@goldmedal goldmedal deleted the fix/join-check branch March 26, 2025 00:52
@goldmedal
Copy link
Contributor Author

Thanks @comphead and @alamb 👍

qstommyshu pushed a commit to qstommyshu/datafusion that referenced this pull request Mar 27, 2025
* add check for missing join condition

* modify the tests

* handle the cross join case

* remove invald sql case

* fix sqllogictests

* fix for cross join case

* revert the change for limit.slt

* add cross join test
@linhr linhr mentioned this pull request Apr 8, 2025
39 tasks
@niebayes
Copy link
Contributor

niebayes commented Apr 21, 2025

@goldmedal I don't understand this change.
Previously, we have a sql:

select * from t
where (select sid from t) = (select a from t limit 1)
order by ts

which actually corresponds to the following:

select * from t where sid = some_value order by ts

The generated plan has a Left Join with a Filter: Boolean(true). I think this sql should not be impacted by this change.

@goldmedal
Copy link
Contributor Author

@goldmedal I don't understand this change. Previously, we have a sql:

select * from t
where (select sid from t) = (select a from t limit 1)
order by ts

The ScalarSubquery will be transformed by ScalarSubqueryToJoin to a left join, and then this PR forces the left join to have a condition. So, in this PR, I added the True expression for the left join in ScalarSubqueryToJoin.

I think the result is expected.

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* add check for missing join condition

* modify the tests

* handle the cross join case

* remove invald sql case

* fix sqllogictests

* fix for cross join case

* revert the change for limit.slt

* add cross join test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants