Build end-to-end test infrastructure and add some tests #54

NGA-TRAN · 2025-07-24T21:28:23Z

This PR:

Enables us to create 3 different types of test clusters:
1. Two-worker cluster that understands small tpc-h files stored in ./tpch/data/
2. Two-worker cluster with mock data of 2 tables (and we add as many as we want)
3. Custom number of workers and custom data files
Adds some tests
1. Tests for cluster setups
2. ~~Tests for all tpc-h queries that run on small data set (20 rows each file)~~
3. Basic tests on mock data

Best way to review this PR

Start with cluster_setup.rs that is a util to create cluster for us. Main jobs of this file:
- Help create 3 different choices of clusters for different testing purposes in one line
- Help execute a given query in one line
- Tests in this files testing cluster setup and operation
~~Then tpch_small.rs that tests all 22 tpc-h queries on a small set of data~~
Then mock_data.rs that creates 2 tables and add some data
Then basic_Tests.rs that shows some example how to test the mock data

Next step

todo: Will link a ticket for useful tests to add shortly

…s have been removed and new built, scripts and readme are added

Move code from internal mp-rs to open source. All datadog dependencie…

…e datadog-specific comments

Rename all datafusion-distributed to distributed-datafusion and remov…

…-version Increase minimum Rust version to 1.82

Add distributed plan and stages to EXPLAIN output

Commit lock and .gitignore

Run `cargo fmt`

Fix compilation error in explain.rs

Add basic Github pipelines

Add serialization and deserialization tests for all tpc-ch queries

Add planning tests

Fix compile and clippy warnings

tests/basic_tests.rs

tests/cluster_setup.rs

tests/tpch_small.rs

tests/cluster_setup.rs

jayshrivastava · 2025-07-25T18:08:47Z

tests/cluster_setup.rs

+        }
+
+        // Give processes time to die
+        thread::sleep(Duration::from_secs(1));


A sub issue to improve this to poll the workers until they die would be good 👍🏽

Or, it can be done in this PR similar to is_cluster_ready

So many needed sleeps so I agree we should have one sub issue to make them better

jayshrivastava · 2025-07-25T18:12:28Z

tests/cluster_setup.rs

+        cmd.args(["--mode", "proxy", "--port", &proxy_port.to_string()])
+            .stdout(Stdio::piped())
+            .stderr(Stdio::piped())
+            .env("RUST_LOG", "info,distributed_datafusion=debug")


This environment vars are a bit confusing. How do they get set?

its used here upon startup

but i agree we should document it somewhere

I agree we should add RUST_LOG into the document. We need a PR to write and update our document

tests/cluster_setup.rs

…g)> with Vec<TableConfig> for improved clarity

tests/cluster_setup.rs

tests/basic_tests.rs

tests/tpch_small.rs

NGA-TRAN · 2025-07-30T02:25:43Z

src/planning.rs

+
+    Ok(())
+}
+


These changes are necessary to support view registration in the same way we currently register tables. They'll remain in place until we establish a proper catalog implementation.

NGA-TRAN · 2025-07-30T02:28:05Z

tests/common/mod.rs

@@ -1,1222 +1,3 @@
-use std::fs;


The red lines here are moved to tpch_validation_helpers.rs to keep this file short and only for importing modules

NGA-TRAN · 2025-07-30T02:31:20Z

tests/tpch_validation_helpers.rs

@@ -0,0 +1,1222 @@
+use std::fs;


The content of this file is from mod.rs. No need to review it. Also, since now we have a way to run queries without going thru Python, soon I will remove this file after I simplify the tpch validation tests

NGA-TRAN · 2025-07-30T03:07:44Z

tests/basic_tests.rs

+        ");
+    }
+
+    // TODO: Investigate why only at most 4 or 5 queries can be executed.


I will create a ticket for this

fmonjalet

No blocker on my end since this is a net improvement of the testing situation. I left a few comments that are now on outdated diffs but are still relevant.

Eventually I think we should avoid spawning processes and listening to network ports for these tests, and rather spawn tokio tasks for the proxy and worker logic, with in memory channels. The reasons for this are mostly:

The ability to instrument for race reproducibility and do fault injection. This is something we had to do a lot on a previous distributed query engine.
The ability to debug and put breakpoints easily (maybe VSCode or RustRover are handling this well though).
Not having to get process and network management right (which can be hard in corner cases).
It's impossible to do when the code relies on global / implicit state, which is a nice feature of the 1 process approach: state has to be properly contained. This is a very important property if we want people to be able to embed this logic in various projects.

But getting a first framework that works well is a great step forward, and the APIs are well built so that the above changes should not impact the tests too much. Thanks for the great work!

fmonjalet · 2025-07-28T10:01:41Z

tests/basic_tests.rs

+static MOCK_DATA_INIT: Once = Once::new();
+
+/// Global cluster instance shared across basic tests
+static BASIC_CLUSTER: OnceCell<TestCluster> = OnceCell::const_new();


Since these are sub processes, how is this cluster cleaned up in case of panic or harder (sigsegv, sigabrt) fault?

This is a very good point. Let me dig in a bit more how we handle this

fmonjalet · 2025-07-28T10:03:52Z

tests/basic_tests.rs

+
+            // Configure cluster with mock data tables
+            let config = ClusterConfig::new()
+                .with_base_port(33000) // Use different port range from TPC-H tests


Ideally in the future we have in-process workers and no network to avoid this kind of hassle (using in-memory channels)

fmonjalet · 2025-07-28T10:05:40Z

tests/basic_tests.rs

+    let result = execute_basic_query("SELECT * FROM customers ORDER BY customer_id").await;
+    assert_snapshot!(result, @r"


This is a really nice and simple API to write tests!

minor (since it's easy to address at any point in the future): Looking at it, I am even wondering if we should not adhere to sqllogic test format instead, so that we can reuse existing test beds.

Do you mean we use sqllogictest instead of of this insta assert_snapshot? I like sqllogictest and I think it is a good idea

Yes exactly this. They have their own quirks so we'll also need the type of test you just wrote in some cases, but there is a huge number of existing tests that we could easily port, which is useful.

fmonjalet · 2025-07-30T09:09:09Z

tests/common/mock_data.rs

+        "5,Eve Brown,Sydney,Australia",
+    ];
+
+    fs::write(path, customers_data.join("\n"))?;


nit: do we need to write this to the FS vs registering this as a MemTable directly?

fmonjalet · 2025-07-30T09:12:30Z

tests/common/test_utils.rs

+pub fn allocate_port_range() -> u16 {
+    let mut current_port = CURRENT_MAX_PORT.lock().unwrap();
+    let base_port = *current_port;
+    *current_port += MAX_CLUSTER_SIZE;


This is fine as long as we don't create too many clusters and tests. Do we have tests that run concurrently and compete for ports?

There is a big range of available ports for us to use and the goal is all test in the same file/module should use only one cluster so the range won't grow too much but I agree this can be an issue when we add a lot more tests.

I will open ticket for us to improve this. In general, the test infrastructure for a distributed system is way more complicated than a single-node and will need a lot of work before we feel comfortable

NGA-TRAN and others added 30 commits June 20, 2025 10:57

Initial commit

6c4b806

Move code from internal mp-rs to open source. All datadog dependencie…

ec7117a

…s have been removed and new built, scripts and readme are added

Merge pull request #1 from datafusion-contrib/ntran/move_code

24f7ff2

Move code from internal mp-rs to open source. All datadog dependencie…

rename all datafusion-distributed to distributed-datafusion and remov…

3c9c003

…e datadog-specific comments

Merge pull request #2 from datafusion-contrib/ntran/cleanup

94d902d

Rename all datafusion-distributed to distributed-datafusion and remov…

Increase minimum Rust version to 1.82

923adb4

Merge pull request #30 from datafusion-contrib/pierre.lacave/rust-min…

c59bde2

…-version Increase minimum Rust version to 1.82

Add distributed plan and stages to EXPLAIN output

e5a284e

some cleanup

537476f

Move explain functions to explain.rs

053bb71

Add more tests

250d09b

Split modules to keep proxy_service.rs concise

cd57594

Merge pull request #31 from datafusion-contrib/ntran/explain

84ed49c

Add distributed plan and stages to EXPLAIN output

Add result validation script for all TPC-H queries at SF=1

a0375bf

Add view from environment when starting the cluster

e52eed4

WIP for explain analyze

4bae189

Add cargo test to validate tpch result validation

c0288e9

Add gitignore

73b9723

Commit lock

be5aac0

Run cargo fmt

5c1e1ba

Merge pull request #37 from datafusion-contrib/commit-lock

c8c91e6

Commit lock and .gitignore

Merge pull request #39 from datafusion-contrib/run-cargo-fmt

c78719c

Run `cargo fmt`

Fix compilation error in explain.rs

6c4960e

Merge pull request #40 from datafusion-contrib/fix-compilation-error

ae8d2bb

Fix compilation error in explain.rs

Add basic Github pipelines

20cfe28

Install protobuf compiler in pipelines

8edb267

Fix APT package installation

0aaa9a6

Fix Cargo.lock

c33711d

Support for substrait plans

62125fd

Merge pull request #38 from datafusion-contrib/add-github-pipelines

ceff46f

Add basic Github pipelines

LiaCastaneda and others added 6 commits July 21, 2025 14:28

Add planning tests

16588b8

Merge pull request #49 from datafusion-contrib/ntran/dftest

d7a5491

Add serialization and deserialization tests for all tpc-ch queries

Merge pull request #50 from datafusion-contrib/lia/bring-unit-tests-back

6a48d5f

Add planning tests

Fix comiple and clippy warnings

77c74bf

Merge pull request #52 from datafusion-contrib/ntran/cu

07e26de

Fix compile and clippy warnings

Build end-to-end test infrastructure and add some tests

18ed7d3

NGA-TRAN commented Jul 24, 2025

View reviewed changes

tests/basic_tests.rs Outdated Show resolved Hide resolved

tests/cluster_setup.rs Outdated Show resolved Hide resolved

tests/tpch_small.rs Outdated Show resolved Hide resolved

NGA-TRAN mentioned this pull request Jul 25, 2025

[Epic] High-Priority Work: Critical Bugs, Infra Gaps & Core Features #55

Open

21 tasks

jayshrivastava reviewed Jul 25, 2025

View reviewed changes

Refactor table representation: replace HashMap<String, (String, Strin…

a68fba3

…g)> with Vec<TableConfig> for improved clarity

LiaCastaneda reviewed Jul 28, 2025

View reviewed changes

tests/cluster_setup.rs Outdated Show resolved Hide resolved

tests/basic_tests.rs Outdated Show resolved Hide resolved

tests/tpch_small.rs Outdated Show resolved Hide resolved

NGA-TRAN marked this pull request as draft July 28, 2025 20:10

NGA-TRAN added 2 commits July 29, 2025 22:13

All basic test queries are running in distributed mod now

33971ae

cleanup

5a437e3

NGA-TRAN commented Jul 30, 2025

View reviewed changes

NGA-TRAN marked this pull request as ready for review July 30, 2025 02:32

address the remaining review comments

ff49974

NGA-TRAN requested review from LiaCastaneda, fmonjalet and jayshrivastava and removed request for LiaCastaneda and jayshrivastava July 30, 2025 03:04

NGA-TRAN commented Jul 30, 2025

View reviewed changes

fmonjalet approved these changes Jul 30, 2025

View reviewed changes

This was referenced Jul 30, 2025

Simplify TPC-H Validation #64

Closed

Test Explain Analyze #65

Open

Local Benchmarking #66

Closed

gabotechs closed this Jul 31, 2025

gabotechs force-pushed the main branch from 4b9a48c to c299b80 Compare July 31, 2025 06:30

gabotechs deleted the ntran/tests branch August 4, 2025 14:47

		let result = execute_basic_query("SELECT * FROM customers ORDER BY customer_id").await;
		assert_snapshot!(result, @r"

Build end-to-end test infrastructure and add some tests #54

Build end-to-end test infrastructure and add some tests #54

Uh oh!

Conversation

NGA-TRAN commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR:

Best way to review this PR

Next step

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiaCastaneda Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fmonjalet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

NGA-TRAN commented Jul 24, 2025 •

edited

Loading

LiaCastaneda Jul 28, 2025 •

edited

Loading