Skip to content

Calculate byte size via sampling in StateBackedIterable if size is not cheap to calculate#33780

Merged
scwhittle merged 9 commits intoapache:masterfrom
stankiewicz:sample_reporting_size
Feb 4, 2025
Merged

Calculate byte size via sampling in StateBackedIterable if size is not cheap to calculate#33780
scwhittle merged 9 commits intoapache:masterfrom
stankiewicz:sample_reporting_size

Conversation

@stankiewicz
Copy link
Copy Markdown
Contributor

For Dataflow V2, StateBackedIterable is iterated by readers after gbk shuffle. Examples are ParDo after GBK or merging combiners after GBK.

metrics.proto specifies that Sampling is used because calculating the byte count involves serializing the elements which is CPU intensive.

In case of StateBackedIterable sampling is not occurring which impacts performance of some of the pipelines that have expensive coders.

This change introduces sampling.

Fully fixes #33620 as previous fix was improvement.

@github-actions github-actions bot added the java label Jan 28, 2025
@stankiewicz stankiewicz marked this pull request as ready for review January 29, 2025 08:23
@github-actions
Copy link
Copy Markdown
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@stankiewicz
Copy link
Copy Markdown
Contributor Author

Run Java_PVR_Flink_Docker PreCommit

@stankiewicz stankiewicz requested a review from scwhittle February 3, 2025 09:48
@stankiewicz stankiewicz requested a review from scwhittle February 3, 2025 19:23
@stankiewicz
Copy link
Copy Markdown
Contributor Author

Run Java PreCommit

@scwhittle scwhittle changed the title Sample reporting as observer is sampling distribution Calculate byte size via sampling in StateBackedIterable if size is not cheap to calculate Feb 4, 2025
@scwhittle scwhittle merged commit 7356785 into apache:master Feb 4, 2025
17 checks passed
VardhanThigle pushed a commit to VardhanThigle/beam that referenced this pull request Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: StateBackedIterable serializes elements size for every element when ComposedCombine is used

2 participants