feat(rust/sedona-spatial-join): Add config to disable spatial join reordering#733
feat(rust/sedona-spatial-join): Add config to disable spatial join reordering#7332010YOUY01 wants to merge 1 commit intoapache:mainfrom
Conversation
| /// 3. Do not swap the join order if join reordering is disabled or no relevant | ||
| /// statistics are available. | ||
| fn should_swap_join_order( | ||
| spatial_join_options: &SpatialJoinOptions, |
There was a problem hiding this comment.
Use SpatialJoinOptions arg instead of a boolean flag, since this function will likely require additional options in the future and this approach is more extensible.
| return file_names[:2] | ||
|
|
||
|
|
||
| def test_spatial_join_reordering_can_be_disabled_e2e(geoarrow_data): |
There was a problem hiding this comment.
One consideration is the test’s running time: the two Parquet files are around 10 KB, and this test runs in about 0.1 s on my machine.
I couldn’t find another way to perform the e2e test. The sd_random_geometry() table function does not seem to produce statistics, so spatial join reordering cannot occur.
Hope this is okay.
There was a problem hiding this comment.
Definitely ok! (Parquet tests use some of the larger files to ensure everything works with realistic input)
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you!
This is all good from my end, although I usually think of "join ordering" as involving multiple joins (this is more like choosing the indexed side of a single join). I am also not sure I have a better idea of what to call the option 🙂
| assert path_left.exists(), f"Missing test asset: {path_left}" | ||
| assert path_right.exists(), f"Missing test asset: {path_right}" |
There was a problem hiding this comment.
I believe we have sedonadb.testing.skip_if_not_exists() for this (so the tests can run without the submodule)
| return file_names[:2] | ||
|
|
||
|
|
||
| def test_spatial_join_reordering_can_be_disabled_e2e(geoarrow_data): |
There was a problem hiding this comment.
Definitely ok! (Parquet tests use some of the larger files to ensure everything works with realistic input)
| def _plan_text(df): | ||
| query_plan = df.to_pandas() | ||
| return "\n".join(query_plan.iloc[:, 1].astype(str).tolist()) | ||
|
|
||
|
|
||
| def _spatial_join_side_file_names(plan_text): |
There was a problem hiding this comment.
I opened #734 so that we can make hopefully make this easier some day.
Motivation
Heuristics-based join reordering can fail. Providing an option to disable it allows manual control over spatial join order—the execution order will match the query order.
This configuration only affects spatial joins, not regular joins, for greater flexibility.
Demo
In sedona-cli