Skip to content

Conversation

Kimahriman
Copy link
Contributor

What changes were proposed in this pull request?

Fixes a performance issue with the new maxBytesPerTrigger option for file stream sources introduced in #44636. This changes the iteration of files for calculating offsets when maxBytesPerTrigger is used with file stream sources from list indexing to iteration.

Why are the changes needed?

We tried out this new option and found streams reading for tables with a lot of files (in the millions) were spending hours trying to construct batches. After looking at the thread dump for these, I could see a Scala immutable.List was what the files object was stored as, which is a linked list under the hood with O(n) lookup time by index, making the takesFilesUntilMax method a O(n^2) operation. Instead the list should simply be iterated over which makes it a simple O(n) operation overall.

Does this PR introduce any user-facing change?

No, just performance.

How was this patch tested?

No new tests, purely a performance improvement.

Was this patch authored or co-authored using generative AI tooling?

No

@Kimahriman
Copy link
Contributor Author

@MaxNevermind @viirya @dongjoon-hyun since you were the original creator/reviewers

@Kimahriman Kimahriman changed the title [SPARK-53797] maxBytesPerTrigger on file stream sources is incredibly slow [SPARK-53797] Improve performance of maxBytesPerTrigger for file stream sources Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant