Commit c9b1819
authored
[opt](multi-catalog) Optimize file split size. (#58858)
### What problem does this PR solve?
### Release note
This PR introduces a **dynamic and progressive file split size
adjustment mechanism** to improve scan parallelism and resource
utilization for external table scans, while avoiding excessive small
splits or inefficiently large initial splits.
#### 1. Split Size Adjustment Strategy
##### 1.1 Non-Batch Split Mode
In non-batch split mode, a **two-phase split size selection strategy**
is applied based on the total size of all input files:
* The total size of all splits is calculated in advance.
* If the total size **exceeds `maxInitialSplitNum *
maxInitialSplitSize`**:
* `split_size = maxSplitSize` (default **64MB**)
* Otherwise:
* `split_size = maxInitialSplitSize` (default **32MB**)
This strategy reduces the number of splits for small datasets while
improving parallelism for large-scale scans.
---
##### 1.2 Batch Split Mode
In batch split mode, a **progressive split size adjustment strategy** is
introduced:
* As the total file size increases,
* When the number of files gradually **exceeds `maxInitialSplitNum`**,
* The `split_size` is **smoothly increased from `maxInitialSplitSize`
(32MB) toward `maxSplitSize` (64MB)**.
This approach avoids generating too many small splits at the early stage
while gradually increasing scan parallelism as the workload grows,
resulting in more stable scheduling and execution behavior.
---
##### 1.3 User-Specified Split Size (Backward Compatibility)
This PR **preserves the session variable `file_split_size`** for
user-defined split size configuration:
* If `file_split_size` is explicitly set by the user:
* The user-defined value takes precedence.
* The dynamic split size adjustment logic is bypassed.
* This ensures full backward compatibility with existing configurations
and tuning practices.
---
#### 2. Support Status by Data Source
| Data Source | Non-Batch Split Mode | Batch Split Mode | Notes |
| ----------- | -------------------- | ---------------- |
----------------------------------------------------- |
| Hive | ✅ Supported | ✅ Supported | Uses Doris internal HDFS
FileSplitter |
| Iceberg | ✅ Supported | ❌ Not supported | File splitting is currently
delegated to Iceberg APIs |
| Paimon | ✅ Supported | ❌ Not supported | Only non-batch split mode is
implemented |
---
#### 3. New Hive HDFS FileSplitter Logic
For Hive HDFS files, this PR introduces an enhanced file splitting
strategy:
1. **Splits never span multiple HDFS blocks**
* Prevents cross-block reads and avoids unnecessary IO overhead.
2. **Tail split optimization**
* If the remaining file size is smaller than `split_size * 2`,
* The remaining part is **evenly divided** into splits,
* Preventing the creation of very small tail splits and improving
overall scan efficiency.
---
#### Summary
* Introduces dynamic and progressive split size adjustment
* Supports both batch and non-batch split modes
* Preserves user-defined split size configuration for backward
compatibility
* Optimizes Hive HDFS file splitting to reduce small tail splits and
cross-block IO1 parent a3252b7 commit c9b1819
File tree
14 files changed
+712
-96
lines changed- fe/fe-core/src
- main/java/org/apache/doris
- datasource
- hive/source
- iceberg/source
- paimon/source
- tvf/source
- qe
- test/java/org/apache/doris
- datasource
- paimon/source
- planner
- regression-test/suites/external_table_p0/hive
14 files changed
+712
-96
lines changedLines changed: 7 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
| 66 | + | |
66 | 67 | | |
67 | 68 | | |
68 | 69 | | |
69 | 70 | | |
70 | 71 | | |
71 | 72 | | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
72 | 76 | | |
73 | 77 | | |
74 | 78 | | |
| |||
220 | 224 | | |
221 | 225 | | |
222 | 226 | | |
| 227 | + | |
223 | 228 | | |
224 | 229 | | |
225 | 230 | | |
| |||
228 | 233 | | |
229 | 234 | | |
230 | 235 | | |
231 | | - | |
232 | | - | |
233 | 236 | | |
234 | 237 | | |
235 | 238 | | |
| |||
246 | 249 | | |
247 | 250 | | |
248 | 251 | | |
249 | | - | |
250 | 252 | | |
251 | 253 | | |
252 | 254 | | |
| |||
276 | 278 | | |
277 | 279 | | |
278 | 280 | | |
279 | | - | |
280 | 281 | | |
281 | 282 | | |
282 | 283 | | |
| |||
302 | 303 | | |
303 | 304 | | |
304 | 305 | | |
305 | | - | |
| 306 | + | |
306 | 307 | | |
307 | 308 | | |
308 | 309 | | |
| |||
499 | 500 | | |
500 | 501 | | |
501 | 502 | | |
| 503 | + | |
Lines changed: 4 additions & 15 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
94 | 94 | | |
95 | 95 | | |
96 | 96 | | |
| 97 | + | |
| 98 | + | |
97 | 99 | | |
98 | 100 | | |
99 | 101 | | |
| |||
134 | 136 | | |
135 | 137 | | |
136 | 138 | | |
| 139 | + | |
| 140 | + | |
137 | 141 | | |
138 | 142 | | |
139 | 143 | | |
| |||
618 | 622 | | |
619 | 623 | | |
620 | 624 | | |
621 | | - | |
622 | | - | |
623 | | - | |
624 | | - | |
625 | | - | |
626 | | - | |
627 | | - | |
628 | | - | |
629 | | - | |
630 | | - | |
631 | | - | |
632 | | - | |
633 | | - | |
634 | | - | |
635 | | - | |
636 | 625 | | |
Lines changed: 1 addition & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | 65 | | |
69 | 66 | | |
70 | 67 | | |
| |||
115 | 112 | | |
116 | 113 | | |
117 | 114 | | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
| 115 | + | |
124 | 116 | | |
125 | 117 | | |
126 | 118 | | |
| |||
0 commit comments