Skip to content

Streamline non-cached state backed iterable.#34746

Merged
robertwb merged 3 commits intoapache:masterfrom
robertwb:gbk-iterable-streamlining
May 7, 2025
Merged

Streamline non-cached state backed iterable.#34746
robertwb merged 3 commits intoapache:masterfrom
robertwb:gbk-iterable-streamlining

Conversation

@robertwb
Copy link
Contributor

@robertwb robertwb commented Apr 25, 2025

Avoids all the bookeeping logic of creating and caching blocks of elements, which in addition to reducing total work and simplifying the codepaths allows decoded elements to be ephemerally decoded and consumed one at a time rather than storing them in large lists even when actual caching is disabled which should play much better with the garbage collector.

Note that this disables caching of the tail of GBK iterables, which will result in more side input reads and possible performance degradation for those GBKs whose value iterables are too large to fit over the Data API but small enough to be substantially cached and are re-iterated. IMHO this is a reasonable tradeoff as (1) large values like this are the antithesis of what one wants to place in the cache (which can result in evicting all other values over the course of iteration) and (2) this provides increased performance (and in many cases avoiding outright failure, e.g. due to GC memory thrashing) for the common usecase of no re-iteration (including writing lots of data to fixed shards/dynamic destinations), and the most common case of re-iteration (CoGBK) does its own caching anyway.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Avoids all the bookeeping logic of creating and caching blocks of elements,
which in addition allows decoded elements to be ephemerally decoded and
consumed one at a time rather than storing them in large lists even when
actual caching is disabled.
@robertwb
Copy link
Contributor Author

FYI @priyansndesai

@github-actions github-actions bot added the java label Apr 25, 2025
@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

return new UncachedStateIterable<>(beamFnStateClient, stateRequestForFirstChunk, valueCoder);
}

static class UncachedStateIterable<T> extends PrefetchableIterables.Default<T> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isn't used for state anywhere (I don't think it is) we probably don't even need to use the Prefethable* interfaces at all, simplifying things further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to refer to the StateBackedIterable class itself. StateFetchingIterators are definitely used elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry still not sure I'm understanding. Are you suggesting just inlining this class in StateBackedIterable to avoid having it use the interfaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, e.g. we could probably change declaration at https://github.com/apache/beam/blob/release-2.60.0/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/StateBackedIterable.java#L77 to be just an Iterable, changing the constructor for StateBackedIterable[Coder] to not take a Cache at all, etc. But it's also not too much work to keep it as a Prefetchable iterable given the underlying iterable is.

return new UncachedStateIterable<>(beamFnStateClient, stateRequestForFirstChunk, valueCoder);
}

static class UncachedStateIterable<T> extends PrefetchableIterables.Default<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertwb robertwb force-pushed the gbk-iterable-streamlining branch from 3484504 to 877f111 Compare May 6, 2025 22:50
@scwhittle
Copy link
Contributor

Run Java PreCommit

@scwhittle
Copy link
Contributor

BoundedQueueExecutor test is unrelated flake, the spark tests were with side inputs so not sure if related, rerunning tests.

Copy link
Contributor

@scwhittle scwhittle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as long as test failure is unrelated

@priyansndesai
Copy link
Contributor

Do we want to capability protect this?

@robertwb
Copy link
Contributor Author

robertwb commented May 7, 2025

Do we want to capability protect this?

This isn't really a capability--the interaction with the runner remains the same. I thought about guarding this with an experiment, but there isn't really a good way to plumb it down.

@robertwb robertwb merged commit 5b862dd into apache:master May 7, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants