[SPARK-53797] Improve performance of maxBytesPerTrigger for file stream sources #52515
+1
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Fixes a performance issue with the new
maxBytesPerTrigger
option for file stream sources introduced in #44636. This changes the iteration of files for calculating offsets whenmaxBytesPerTrigger
is used with file stream sources from list indexing to iteration.Why are the changes needed?
We tried out this new option and found streams reading for tables with a lot of files (in the millions) were spending hours trying to construct batches. After looking at the thread dump for these, I could see a Scala immutable.List was what the
files
object was stored as, which is a linked list under the hood withO(n)
lookup time by index, making thetakesFilesUntilMax
method aO(n^2)
operation. Instead the list should simply be iterated over which makes it a simpleO(n)
operation overall.Does this PR introduce any user-facing change?
No, just performance.
How was this patch tested?
No new tests, purely a performance improvement.
Was this patch authored or co-authored using generative AI tooling?
No