Skip to content

Commit 0af9ff5

Browse files
authored
Improve sqllogicteset speed by creating only a single large file rather than 2 (#20586)
Draft as it builds on #20576 ## Which issue does this PR close? - Part of #20524 - Follow on to #20576 from @alamb ## Rationale for this change Execution time of the test is dominated by the time writing the parquet files. By reusing the file we can gain around 30% improvement on the execution time here. ## What changes are included in this PR? Building on #20576 we reuse the needed parquet file for the test instead of recreating it. ## Are these changes tested? Ran the test with following results: | | Baseline (2 files) | Optimized (1 file) | |---|---|---| | Min | 33.000s | 22.653s | | Max | 37.662s | 25.489s | | Avg | 34.427s | 24.092s | One open question: does the correctness of this regression test rely on having two **physically separate** files? The race condition in #17197 was in the execution layer — both scans would still be independent `DataSourceExec` nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use `system cp` to copy the file and register two physical files while still only paying the `generate_series` cost once. ## Are there any user-facing changes?
1 parent 93d177d commit 0af9ff5

File tree

1 file changed

+2
-9
lines changed

1 file changed

+2
-9
lines changed

datafusion/sqllogictest/test_files/push_down_filter_regression.slt

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,6 @@
1818
# Test push down filter
1919

2020
# Regression test for https://github.com/apache/datafusion/issues/17188
21-
query I
22-
COPY (select i as k from generate_series(1, 10000000) as t(i))
23-
TO 'test_files/scratch/push_down_filter_regression/t1.parquet'
24-
STORED AS PARQUET;
25-
----
26-
10000000
27-
2821
query I
2922
COPY (select i as k, i as v from generate_series(1, 10000000) as t(i))
3023
TO 'test_files/scratch/push_down_filter_regression/t2.parquet'
@@ -33,10 +26,10 @@ STORED AS PARQUET;
3326
10000000
3427

3528
statement ok
36-
create external table t1 stored as parquet location 'test_files/scratch/push_down_filter_regression/t1.parquet';
29+
create external table t2 stored as parquet location 'test_files/scratch/push_down_filter_regression/t2.parquet';
3730

3831
statement ok
39-
create external table t2 stored as parquet location 'test_files/scratch/push_down_filter_regression/t2.parquet';
32+
create external table t1 (k bigint not null) stored as parquet location 'test_files/scratch/push_down_filter_regression/t2.parquet';
4033

4134
# The failure before https://github.com/apache/datafusion/pull/17197 was non-deterministic and random
4235
# So we'll run the same query a couple of times just to have more certainty it's fixed

0 commit comments

Comments
 (0)