Improve sqllogicteset speed by creating only a single large file rather than 2 (#20586)

Tim-53 · web-flow · commit 0af9ff59986b · 2026-03-02T16:15:28.000Z
Draft as it builds on #20576 ## Which issue does this PR close? - Part of #20524 - Follow on to #20576 from @alamb ## Rationale for this change Execution time of the test is dominated by the time writing the parquet files. By reusing the file we can gain around 30% improvement on the execution time here. ## What changes are included in this PR? Building on #20576 we reuse the needed parquet file for the test instead of recreating it. ## Are these changes tested? Ran the test with following results: | | Baseline (2 files) | Optimized (1 file) | |---|---|---| | Min | 33.000s | 22.653s | | Max | 37.662s | 25.489s | | Avg | 34.427s | 24.092s | One open question: does the correctness of this regression test rely on having two **physically separate** files? The race condition in #17197 was in the execution layer — both scans would still be independent `DataSourceExec` nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use `system cp` to copy the file and register two physical files while still only paying the `generate_series` cost once. ## Are there any user-facing changes?
diff --git a/datafusion/sqllogictest/test_files/push_down_filter_regression.slt b/datafusion/sqllogictest/test_files/push_down_filter_regression.slt
@@ -18,13 +18,6 @@
 # Test push down filter
 
 # Regression test for https://github.com/apache/datafusion/issues/17188
-query I
-COPY (select i as k from generate_series(1, 10000000) as t(i))
-TO 'test_files/scratch/push_down_filter_regression/t1.parquet'
-STORED AS PARQUET;
-----
-10000000
-
 query I
 COPY (select i as k, i as v from generate_series(1, 10000000) as t(i))
 TO 'test_files/scratch/push_down_filter_regression/t2.parquet'
@@ -33,10 +26,10 @@ STORED AS PARQUET;
 10000000
 
 statement ok
-create external table t1 stored as parquet location 'test_files/scratch/push_down_filter_regression/t1.parquet';
+create external table t2 stored as parquet location 'test_files/scratch/push_down_filter_regression/t2.parquet';
 
 statement ok
-create external table t2 stored as parquet location 'test_files/scratch/push_down_filter_regression/t2.parquet';
+create external table t1 (k bigint not null) stored as parquet location 'test_files/scratch/push_down_filter_regression/t2.parquet';
 
 # The failure before https://github.com/apache/datafusion/pull/17197 was non-deterministic and random
 # So we'll run the same query a couple of times just to have more certainty it's fixed