Skip to content

Simplify TPC-H Validation #64

@NGA-TRAN

Description

@NGA-TRAN

Our TPC-H scale 1 validation currently depends on Python, datafusion-cli, and the TPC-H data generator, making it unsuitable for CI. With the newly added infrastructure , we're now able to streamline this test and integrate it into CI. We should modify the test so we can run them in CI.

Other changes for this validation:

  1. We do not need to run datafusion-cli to get result of single node either. We can run the queries directly from SessionContext. See/use execute_sql_single_node function in the PR above.
  2. I do not think we need a lot of data (scale 1) to validate the result either. I suspect we can generate scale 0.01 (or smaller/larger) which is large enough for the validation but small enough to check the data files in to avoid regenerating data every time running the test (in CI). We can replace tpch_small files in tpch/data/ with these files for different purpose of tests. If tpch-generate CLI cannot generate scale < 1, we can also write a script to reduce scale 1 data files but still include needed data to return meaningful results for all 22 queries (I did this with one file before and happy show how to do this)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions