Commit 0af9ff5
authored
Improve sqllogicteset speed by creating only a single large file rather than 2 (#20586)
Draft as it builds on #20576
## Which issue does this PR close?
- Part of #20524
- Follow on to #20576 from
@alamb
## Rationale for this change
Execution time of the test is dominated by the time writing the parquet
files. By reusing the file we can gain around 30% improvement on the
execution time here.
## What changes are included in this PR?
Building on #20576 we reuse the needed parquet file for the test instead
of recreating it.
## Are these changes tested?
Ran the test with following results:
| | Baseline (2 files) | Optimized (1 file) |
|---|---|---|
| Min | 33.000s | 22.653s |
| Max | 37.662s | 25.489s |
| Avg | 34.427s | 24.092s |
One open question: does the correctness of this regression test rely on
having two **physically separate** files? The race condition in #17197
was in the execution layer — both scans would still be independent
`DataSourceExec` nodes with independent readers, so I believe the
behavior is preserved. But if there's any concern, we could use `system
cp` to copy the file and register two physical files while still only
paying the `generate_series` cost once.
## Are there any user-facing changes?1 parent 93d177d commit 0af9ff5
File tree
1 file changed
+2
-9
lines changed- datafusion/sqllogictest/test_files
1 file changed
+2
-9
lines changedLines changed: 2 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
22 | | - | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | 21 | | |
29 | 22 | | |
30 | 23 | | |
| |||
33 | 26 | | |
34 | 27 | | |
35 | 28 | | |
36 | | - | |
| 29 | + | |
37 | 30 | | |
38 | 31 | | |
39 | | - | |
| 32 | + | |
40 | 33 | | |
41 | 34 | | |
42 | 35 | | |
| |||
0 commit comments