Skip to content

Commit 2099424

Browse files
committed
refactor the tpch validation tests and move them to integration tests
1 parent f526aff commit 2099424

File tree

8 files changed

+1405
-2235
lines changed

8 files changed

+1405
-2235
lines changed

.gitignore

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,27 @@
1+
# Rust build artifacts
12
target/
2-
.idea/
3+
4+
# IDE and editor files
5+
.idea/
6+
7+
# Python virtual environments and files
8+
.venv/
9+
.python_startup.py
10+
__pycache__/
11+
*.py[cod]
12+
*$py.class
13+
*.so
14+
15+
# Log files
16+
*.log
17+
proxy.log
18+
worker*.log
19+
20+
# OS generated files
21+
.DS_Store
22+
.DS_Store?
23+
._*
24+
.Spotlight-V100
25+
.Trashes
26+
ehthumbs.db
27+
Thumbs.db

README.md

Lines changed: 20 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,9 @@ cargo build --release
138138

139139
### Running Tests
140140

141-
Run all tests:
141+
#### Basic Tests
142+
143+
Run all unit tests (fast - excludes TPC-H validation):
142144

143145
```bash
144146
cargo test
@@ -150,6 +152,20 @@ Run tests with output:
150152
cargo test -- --nocapture
151153
```
152154

155+
#### TPC-H Validation Integration Tests
156+
157+
Run comprehensive TPC-H validation tests that compare distributed DataFusion against regular DataFusion. No prerequisites needed - the tests handle everything automatically!
158+
159+
```bash
160+
# Run all TPC-H validation tests (manual - excluded from cargo test for speed)
161+
cargo test --test tpch_validation test_tpch_validation_all_queries -- --ignored --nocapture
162+
163+
# Run single query test for debugging
164+
cargo test --test tpch_validation test_tpch_validation_single_query -- --ignored --nocapture
165+
```
166+
167+
**Note:** TPC-H validation tests are marked with `#[ignore]` to keep `cargo test` fast for development. Run them manually when needed for validation.
168+
153169
## Usage
154170

155171
With the code now built and ready, the next step is to set up the server and execute queries. To do that, we'll need a schema and some data—so for this example, we'll use the TPC-H schema and queries.
@@ -323,84 +339,6 @@ The system supports various configuration options through environment variables:
323339
- `DFRAY_TABLES`: Comma-separated list of tables in format `name:format:path`
324340
- `DFRAY_VIEWS`: Semicolon-separated list of CREATE VIEW SQL statements
325341

326-
## TPC-H Query Validation
327-
328-
To validate that your distributed cluster is working correctly, you can use the automated validation script that compares results between DataFusion CLI (single-node) and the distributed system:
329-
330-
```bash
331-
# Run validation with default settings (2 workers, /tmp/tpch_s1 data)
332-
./scripts/validate_tpch_correctness.sh
333-
334-
# Run validation with custom settings
335-
./scripts/validate_tpch_correctness.sh num_workers=3 tpch_file_path=/path/to/tpch/data log_file_path=./logs query_path=./tpch/queries/
336-
```
337-
338-
**Key Features:**
339-
- **Automated Setup**: Installs `datafusion-cli` and `tpchgen-cli` if missing
340-
- **Data Generation**: Creates TPC-H data automatically if not found
341-
- **Smart Validation**: Compares all 22 TPC-H queries with floating-point tolerance
342-
- **Cluster Detection**: Uses existing cluster or launches a new one
343-
- **Detailed Reporting**: Generates comprehensive validation reports
344-
345-
**Example Output:**
346-
```
347-
==============================================================================
348-
TPC-H Correctness Validation
349-
==============================================================================
350-
Configuration:
351-
- Workers: 2
352-
- TPC-H Data Directory: /tmp/tpch_s1
353-
- Query Path: ./tpch/queries/
354-
- Proxy Port: 20200
355-
356-
[SUCCESS] q1: Results match ✓ (within floating-point tolerance)
357-
[SUCCESS] q6: Results match ✓ (within floating-point tolerance)
358-
...
359-
360-
==============================================================================
361-
Validation Summary
362-
==============================================================================
363-
Total queries tested: 22
364-
Passed: 20
365-
Failed: 2
366-
Success rate: 90%
367-
368-
Detailed report: ./logs/validation_results/validation_report.txt
369-
Result files: ./logs/validation_results
370-
```
371-
372-
The script will warn you if the running cluster has a different number of workers than requested, and automatically handles missing dependencies and data generation.
373-
374-
<!-- TODO: Merge this section into the above -->
375-
## TPC-H Validation Tests
376-
377-
The project includes comprehensive TPC-H validation tests that automatically compare results between regular DataFusion and distributed DataFusion to ensure correctness. These tests are completely self-contained and handle all setup automatically:
378-
379-
```bash
380-
# Run all TPC-H validation tests (fully automated)
381-
cargo test --lib tpch_validation_tests -- --nocapture
382-
383-
# Run single query test for debugging
384-
cargo test --lib test_tpch_validation_single_query -- --ignored --nocapture
385-
```
386-
387-
**What the tests do automatically:**
388-
- ✅ Kill existing processes on ports 40400-40402
389-
- ✅ Install `tpchgen-cli` if not available
390-
- ✅ Generate TPC-H data at `/tmp/tpch_s1` if not present
391-
- ✅ Start distributed cluster (1 proxy + 2 workers)
392-
- ✅ Run all 22 TPC-H queries on both systems
393-
- ✅ Compare results with floating-point tolerance
394-
- ✅ Clean up cluster processes
395-
396-
**Architecture:**
397-
- **Proxy**: Port 40400
398-
- **Worker 1**: Port 40401
399-
- **Worker 2**: Port 40402
400-
- **TPC-H Data**: `/tmp/tpch_s1` (scale factor 1)
401-
402-
No prerequisites needed - just run `cargo test --lib tpch_validation_tests -- --nocapture` and everything is handled automatically!
403-
404342
## Development
405343

406344
### Project Structure
@@ -417,6 +355,9 @@ No prerequisites needed - just run `cargo test --lib tpch_validation_tests -- --
417355
- `launch_python_arrowflightsql_client.sh`: Launch Python query client
418356
- `build_and_push_docker.sh`: Docker build and push script
419357
- `python_tests.sh`: Python test runner
358+
- `tests/`: Integration tests
359+
- `tpch_validation.rs`: TPC-H validation integration tests
360+
- `common/mod.rs`: Shared test utilities and helper functions
420361
- `tpch/queries/`: TPC-H benchmark SQL queries
421362
- `testdata/`: Test data files
422363
- `k8s/`: Kubernetes deployment files

0 commit comments

Comments
 (0)