-
Notifications
You must be signed in to change notification settings - Fork 17
Add test comparing distributed + single node execution on TPCH data #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| assert_eq!( | ||
| batches1.len(), | ||
| batches2.len(), | ||
| "Query {} returned different number of batches: {} vs {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine if the are a different number of batches in distributed mode vs normal mode. What we ultimately care is that the results are the same. For example, it should be fine if normal mode returns 2 batches with 5 rows each and the distributed mode returns 5 batches with 2 rows each, as long as the data is the same.
DataFusion has this pretty_format_batches that will build a string out of the results that is human inspectable, maybe comparing those instead?
gabotechs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 This is awesome! left some minor comments, but the approach looks perfect to me
c073028 to
4fab927
Compare
NGA-TRAN
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Thanks @jayshrivastava .
Do you plan to see how DF support DDL and run q15?
| #[tokio::test] | ||
| #[ignore] | ||
| // TODO: Support query 15? | ||
| // Skip because it contains DDL statements not supported in single SQL execution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine to skip this for now.
I am just curious how Vanilla DF test this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #87 to track this. I will look into how to support views
gabotechs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 awesome!
This PR also included the TPCH queries #73. I think it's fine to have them duplicated for now, but if you find an opportunity to unify them and place them somewhere you think it makes sense that would be awesome.
A batch of PRs that moved some files around was just merged, but solving the conflicts should be trivial: the tests/common/tpch.rs module was moved to src/test_utils/tpch.rs pretty much as-is, you probably want to move your changes to the tpch.rs file there.
4fab927 to
7f7e445
Compare
* add TPCH queries from DataFusion repo (see tests/fixtures/tpch/queries) * add lazy TPCH parquet data generator using github.com/clflushopt/tpchgen-rs * write consistency test that runs queries twice (distributed and single-node) and asserts match
7f7e445 to
3881319
Compare
add TPCH queries from DataFusion repo (see tests/fixtures/tpch/queries)
add lazy TPCH parquet data generator using github.com/clflushopt/tpchgen-rs
write consistency test that runs queries twice (distributed and single-node) and asserts match