Skip to content

Commit 355c7cb

Browse files
committed
[SPARK-54840][SQL] OrcList Pre-allocation
### What changes were proposed in this pull request? This PR optimizes ORC serialization performance by pre-allocating `OrcList` with the exact array size instead of relying on dynamic resizing. **Key changes:** - Pre-allocate `OrcList` with `numElements` from the input array - Avoid multiple ArrayList resize operations and element copying during serialization - Cache `numElements` value to avoid redundant calls in the loop condition ### Why are the changes needed? **Problem:** When serializing arrays to ORC format, the current implementation creates an empty `OrcList` (which extends `ArrayList`) and grows it dynamically. For large arrays, this triggers multiple resize operations, each requiring: 1. Allocating a new larger backing array 2. Copying all existing elements to the new array 3. Discarding the old array **Performance Impact:** For an array with 65,536 elements, the default ArrayList growth pattern (1.5x capacity increase) causes ~16 resize operations, copying approximately 1 million elements in total. **Solution:** By pre-allocating the `OrcList` with the known size, we eliminate all resize operations and associated element copying, resulting in: ``` === Resize Behavior Analysis: 1000000 elements === Target size: 1,000,000 elements Total resizes: 29 Total elements copied: 2,430,972 Final capacity: 1,215,487 Resize sequence: 1. Resize: 10 → 15 (copy 10 elements) 2. Resize: 15 → 22 (copy 15 elements) 3. Resize: 22 → 33 (copy 22 elements) 4. Resize: 33 → 49 (copy 33 elements) 5. Resize: 49 → 73 (copy 49 elements) 6. Resize: 73 → 109 (copy 73 elements) 7. Resize: 109 → 163 (copy 109 elements) 8. Resize: 163 → 244 (copy 163 elements) 9. Resize: 244 → 366 (copy 244 elements) 10. Resize: 366 → 549 (copy 366 elements) 11. Resize: 549 → 823 (copy 549 elements) 12. Resize: 823 → 1,234 (copy 823 elements) 13. Resize: 1,234 → 1,851 (copy 1,234 elements) 14. Resize: 1,851 → 2,776 (copy 1,851 elements) 15. Resize: 2,776 → 4,164 (copy 2,776 elements) 16. Resize: 4,164 → 6,246 (copy 4,164 elements) 17. Resize: 6,246 → 9,369 (copy 6,246 elements) 18. Resize: 9,369 → 14,053 (copy 9,369 elements) 19. Resize: 14,053 → 21,079 (copy 14,053 elements) 20. Resize: 21,079 → 31,618 (copy 21,079 elements) 21. Resize: 31,618 → 47,427 (copy 31,618 elements) 22. Resize: 47,427 → 71,140 (copy 47,427 elements) 23. Resize: 71,140 → 106,710 (copy 71,140 elements) 24. Resize: 106,710 → 160,065 (copy 106,710 elements) 25. Resize: 160,065 → 240,097 (copy 160,065 elements) 26. Resize: 240,097 → 360,145 (copy 240,097 elements) 27. Resize: 360,145 → 540,217 (copy 360,145 elements) 28. Resize: 540,217 → 810,325 (copy 540,217 elements) 29. Resize: 810,325 → 1,215,487 (copy 810,325 elements) Overhead ratio: 2.43x (For every element added, 2.43 elements are copied during resizes) === Testing Array Size: 1000000 === Array Size: 1,000,000 elements Iterations: 50 Without pre-allocation: 1328.69 ms total (26.574 ms/iter) With pre-allocation: 620.36 ms total (12.407 ms/iter) Time saved: 708.33 ms (53.3% improvement) Expected resize count: 29 times Final capacity needed: 1215487 ``` ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization with no functional changes. The output remains identical. ### How was this patch tested? 1. **Existing Tests:** All existing ORC-related tests pass, ensuring correctness is maintained 2. **Performance Testing:** see codeblock above ### Was this patch authored or co-authored using generative AI tooling? GitHub Copilot w/ Claude Sonnet 4.5 Closes #53605 from yaooqinn/SPARK-54840. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>
1 parent f63ccce commit 355c7cb

File tree

1 file changed

+14
-11
lines changed

1 file changed

+14
-11
lines changed

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -178,19 +178,22 @@ class OrcSerializer(dataSchema: StructType) {
178178
result
179179

180180
case ArrayType(elementType, _) => (getter, ordinal) =>
181-
val result = OrcStruct.createValue(orcType)
182-
.asInstanceOf[OrcList[WritableComparable[_]]]
183-
// Need to put all converted values to a list, can't reuse object.
184-
val elementConverter = newConverter(elementType, orcType.getChildren.get(0), reuseObj = false)
185181
val array = getter.getArray(ordinal)
186-
var i = 0
187-
while (i < array.numElements()) {
188-
if (array.isNullAt(i)) {
189-
result.add(null)
190-
} else {
191-
result.add(elementConverter(array, i))
182+
val numElements = array.numElements()
183+
val result = new OrcList[WritableComparable[_]](orcType, numElements)
184+
if (numElements > 0) {
185+
// Need to put all converted values to a list, can't reuse object.
186+
val elementConverter =
187+
newConverter(elementType, orcType.getChildren.get(0), reuseObj = false)
188+
var i = 0
189+
while (i < numElements) {
190+
if (array.isNullAt(i)) {
191+
result.add(null)
192+
} else {
193+
result.add(elementConverter(array, i))
194+
}
195+
i += 1
192196
}
193-
i += 1
194197
}
195198
result
196199

0 commit comments

Comments
 (0)