[SPARK-53797] Improve performance of maxBytesPerTrigger for file stream sources #52515

Kimahriman · 2025-10-03T18:36:07Z

What changes were proposed in this pull request?

Fixes a performance issue with the new maxBytesPerTrigger option for file stream sources introduced in #44636. This changes the iteration of files for calculating offsets when maxBytesPerTrigger is used with file stream sources from list indexing to iteration.

Why are the changes needed?

We tried out this new option and found streams reading for tables with a lot of files (in the millions) were spending hours trying to construct batches. After looking at the thread dump for these, I could see a Scala immutable.List was what the files object was stored as, which is a linked list under the hood with O(n) lookup time by index, making the takesFilesUntilMax method a O(n^2) operation. Instead the list should simply be iterated over which makes it a simple O(n) operation overall.

Does this PR introduce any user-facing change?

No, just performance.

How was this patch tested?

No new tests, purely a performance improvement.

Was this patch authored or co-authored using generative AI tooling?

No

Kimahriman · 2025-10-03T18:37:08Z

@MaxNevermind @viirya @dongjoon-hyun since you were the original creator/reviewers

Use zipWithIndex instead of list indexing for maxBytesPerTrigger

c1bcf7a

github-actions bot added SQL STRUCTURED STREAMING labels Oct 3, 2025

Kimahriman changed the title ~~[SPARK-53797] maxBytesPerTrigger on file stream sources is incredibly slow~~ [SPARK-53797] Improve performance of maxBytesPerTrigger for file stream sources Oct 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53797] Improve performance of maxBytesPerTrigger for file stream sources #52515

[SPARK-53797] Improve performance of maxBytesPerTrigger for file stream sources #52515

Kimahriman commented Oct 3, 2025

Uh oh!

Kimahriman commented Oct 3, 2025

Uh oh!

Uh oh!

[SPARK-53797] Improve performance of maxBytesPerTrigger for file stream sources #52515

Are you sure you want to change the base?

[SPARK-53797] Improve performance of maxBytesPerTrigger for file stream sources #52515

Conversation

Kimahriman commented Oct 3, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Kimahriman commented Oct 3, 2025

Uh oh!

Uh oh!