Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 3, 2025

Which issue does this PR close?

Rationale for this change

While working on the arrow 57 upgrade,

I had to change quite few tests in enforce_sorting and the fact they are
not yet ported to insta means it was very painful to do.

What changes are included in this PR?

rewrite enforce_sorting so it uses insta for plan snapshots rather than hard coded vec<&str>

Also avoid the use of a macro for easier maintenance (and hopefully faster compile time)

Are these changes tested?

Yes, by existing tests

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Oct 3, 2025
let join = hash_join_exec(left_input, right_input, on, None, &JoinType::Inner)?;
let physical_plan = sort_exec([sort_expr("a", &join.schema())].into(), join);

let test = EnforceSortingTest {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see that the plans did not change by turning on whitespace blind diff: https://github.com/apache/datafusion/pull/17900/files?w=1

let expected_optimized: Vec<&str> =optimized_plan_string.trim().lines().collect();

// return a string with both input and optimized plan
format!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This somewhat crazy formatting is designed to make it easier to review the diff on github

@alamb alamb force-pushed the alamb/enforce_sorting_insta branch from 9311077 to ecfc0f8 Compare October 3, 2025 14:59
@alamb alamb force-pushed the alamb/enforce_sorting_insta branch from ecfc0f8 to 1cac670 Compare October 3, 2025 15:17
@alamb alamb changed the title Refactor: Update enforce_sorting tests to use insta snapshots for easier updates Refactor: Update enforce_sorting tests to use insta snapshots for easier updates (part 1) Oct 3, 2025
@alamb alamb marked this pull request as ready for review October 3, 2025 15:26
@blaginin blaginin self-requested a review October 3, 2025 15:36

/// Runs the enforce sorting test and returns a string with the input and
/// optimized plan as strings for snapshot comparison using insta
fn run(&self) -> String {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this mimics / replicates the logic of assert_optimized_plan!

I was not able to remove that macro yet because there are several parameterized plans that use it that I have to migrate somehow separately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed!

@blaginin
Copy link
Contributor

blaginin commented Oct 4, 2025

Since I was also working on that, I'll push a few commits to this branch to speed up the merge. Feel free to drop them if they don't fit

// should not add a sort at the output of the union, input plan should not be changed
assert_optimized!(expected_input, expected_input, physical_plan, true);
Input / Optimized Plan:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if those two are the same, i think there's no need to duplicate the plans - so i assert just one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also i removed expect_no_change, because it was duplicated and actually didn't catch some cases (see 0fd01c7)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very clever -- thank you

Comment on lines +333 to +337
Input Plan:
SortExec: expr=[a@2 ASC], preserve_partitioning=[false]
HashJoinExec: mode=Partitioned, join_type=Inner, on=[(col_a@0, c@2)]
DataSourceExec: partitions=1, partition_sizes=[0]
DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b, c, d, e], output_ordering=[a@0 ASC], file_type=parquet
Copy link
Contributor

@blaginin blaginin Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is definitely just a personal preference but I find merged strings easier to follow than array representation with [] and "".

Before it would make sense, because it'd allow you to copy and paste the plan - but now this is managed by insta so no need use debug representation in my opinion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I was using the debug representation to try and make review easier (to see that the plans hadn't changed)

blaginin and others added 5 commits October 4, 2025 19:18
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Split the rstest parameterized test into two separate functions:
- test_with_lost_ordering_unbounded
- test_with_lost_ordering_bounded

Each tests both repartition_sorts modes (false and true) using insta snapshots.
Replaced the incomplete existing test_with_lost_ordering_bounded which only
tested one mode.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@@ -2525,1213 +2696,1255 @@ async fn test_window_partial_constant_and_set_monotonicity() -> Result<()> {
"avg".to_string(),
function_arg_unordered,
);
struct TestCase<'a> {
Copy link
Contributor

@blaginin blaginin Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this test I asked claude to write a script that confirms before and after are the same:
https://gist.github.com/blaginin/606a2b6d9ba48627d0bdf5b9aa6d1c57

i manually changed something in "after" and it have caught it. also i randomly verified some cases

Comment on lines 576 to 578
.with_expected_description(
"// Should adjust the requirement in the third input of the union so\n\
// that it is not unnecessarily fine.",
Copy link
Contributor

@blaginin blaginin Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore, but I think it's not the best idea to mix comments and snapshots - it'll now be repeated twice in the code, making it messier, and it won't be clear for newcomers which one to update

I personally don’t see an issue with just adding the comment after the assert. I did that separately in: blaginin@ecdf6c3 in case you’ll like it🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Applied

Copy link
Contributor

@blaginin blaginin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!! wrote some comments, and you may need to check the changes i pushed here 😅

@blaginin blaginin changed the title Refactor: Update enforce_sorting tests to use insta snapshots for easier updates (part 1) Refactor: Update enforce_sorting tests to use insta snapshots for easier updates Oct 4, 2025
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @blaginin -- I think this test is now much easier to deal with ❤️

// should not add a sort at the output of the union, input plan should not be changed
assert_optimized!(expected_input, expected_input, physical_plan, true);
Input / Optimized Plan:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very clever -- thank you

Comment on lines 576 to 578
.with_expected_description(
"// Should adjust the requirement in the third input of the union so\n\
// that it is not unnecessarily fine.",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Applied

@@ -2480,6 +2547,73 @@ async fn test_window_partial_constant_and_set_monotonicity() -> Result<()> {
.into();
let source = parquet_exec_with_sort(input_schema.clone(), vec![ordering]) as _;

// Macro for testing window function optimization with snapshots
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor Author

alamb commented Oct 6, 2025

Thank you very much for helping to push this along @blaginin 🙏

@alamb alamb added this pull request to the merge queue Oct 6, 2025
Merged via the queue into apache:main with commit e03ba61 Oct 6, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants