Skip to content

Conversation

@jayshrivastava
Copy link
Collaborator

  • add TPCH queries from DataFusion repo (see tests/fixtures/tpch/queries)

  • add lazy TPCH parquet data generator using github.com/clflushopt/tpchgen-rs

  • write consistency test that runs queries twice (distributed and single-node) and asserts match

@jayshrivastava jayshrivastava marked this pull request as ready for review August 8, 2025 04:39
Comment on lines 91 to 94
assert_eq!(
batches1.len(),
batches2.len(),
"Query {} returned different number of batches: {} vs {}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine if the are a different number of batches in distributed mode vs normal mode. What we ultimately care is that the results are the same. For example, it should be fine if normal mode returns 2 batches with 5 rows each and the distributed mode returns 5 batches with 2 rows each, as long as the data is the same.

DataFusion has this pretty_format_batches that will build a string out of the results that is human inspectable, maybe comparing those instead?

Copy link
Collaborator

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 This is awesome! left some minor comments, but the approach looks perfect to me

Copy link
Collaborator

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Thanks @jayshrivastava .
Do you plan to see how DF support DDL and run q15?

#[tokio::test]
#[ignore]
// TODO: Support query 15?
// Skip because it contains DDL statements not supported in single SQL execution
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine to skip this for now.
I am just curious how Vanilla DF test this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #87 to track this. I will look into how to support views

Copy link
Collaborator

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 awesome!

This PR also included the TPCH queries #73. I think it's fine to have them duplicated for now, but if you find an opportunity to unify them and place them somewhere you think it makes sense that would be awesome.

A batch of PRs that moved some files around was just merged, but solving the conflicts should be trivial: the tests/common/tpch.rs module was moved to src/test_utils/tpch.rs pretty much as-is, you probably want to move your changes to the tpch.rs file there.

* add TPCH queries from DataFusion repo (see tests/fixtures/tpch/queries)

* add lazy TPCH parquet data generator using github.com/clflushopt/tpchgen-rs

* write consistency test that runs queries twice (distributed and single-node)
  and asserts match
@jayshrivastava jayshrivastava merged commit 92c5a0e into main Aug 8, 2025
3 checks passed
@jayshrivastava jayshrivastava deleted the js/consistency-test branch August 8, 2025 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants