Skip to content

[Java SDK] Warn when ValueState contains collection types#37530

Open
PDGGK wants to merge 3 commits intoapache:masterfrom
PDGGK:warn-valuestate-collection-clean
Open

[Java SDK] Warn when ValueState contains collection types#37530
PDGGK wants to merge 3 commits intoapache:masterfrom
PDGGK:warn-valuestate-collection-clean

Conversation

@PDGGK
Copy link
Contributor

@PDGGK PDGGK commented Feb 6, 2026

Summary

This PR adds a warning when users declare ValueState with collection types (Map, List, Set) that could benefit from using specialized state types for better performance.

Problem:
When users store collections in ValueState, the entire collection must be read and written on each access. This can cause significant performance issues for large collections.

Solution:
Log a warning during pipeline construction suggesting:

  • ValueState<Map> → Use MapState instead
  • ValueState<List> → Use BagState or OrderedListState instead
  • ValueState<Set> → Use SetState instead

Changes:

  • DoFnSignatures.java: Added warnIfValueStateContainsCollection() method that inspects state declarations and logs warnings for collection types
  • DoFnSignaturesTest.java: Added test cases to verify the warning logic works correctly

Fixes #36746

Test plan

  • Added tests for ValueState<Map>, ValueState<List>, ValueState<Set>
  • Added test for simple ValueState<String> (no warning expected)
  • Existing tests should continue to pass

🤖 Generated with Claude Code

When users declare ValueState<Map>, ValueState<List>, or ValueState<Set>,
log a warning suggesting they use MapState, BagState, or SetState instead.

Storing collections in ValueState requires reading and writing the entire
collection on each access, which can cause performance issues for large
collections. The specialized state types provide better performance.

Fixes apache#36746
@github-actions github-actions bot added the java label Feb 6, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @PDGGK, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential performance bottleneck in Apache Beam by introducing a warning mechanism. When users declare ValueState with Java collection types (like Map, List, or Set), the system will now log a warning during pipeline construction. This warning advises users to switch to more efficient, specialized state types (e.g., MapState, BagState) to avoid performance issues that arise from serializing and deserializing entire collections on each state access. This change aims to guide users towards more optimal state management patterns without breaking existing functionality.

Highlights

  • Warning for ValueState with Collection Types: A new warning mechanism has been introduced to detect when ValueState is declared with Java collection types such as Map, List, or Set.
  • Performance Recommendation: The warning advises users to switch to more performant, specialized state types like MapState, BagState, OrderedListState, or SetState to avoid performance issues associated with serializing/deserializing entire collections.
  • Pipeline Construction Integration: The warning is issued during pipeline construction, specifically when DoFn signatures are analyzed, providing early feedback to developers.
  • Test Coverage: New unit tests have been added to verify that warnings are correctly triggered for collection types and not for simple types, ensuring the new logic functions as expected.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/reflect/DoFnSignatures.java
    • Added SLF4J logging imports and initialized a Logger instance.
    • Implemented a new private static method warnIfValueStateContainsCollection to inspect ValueState declarations.
    • Integrated the warning logic into the analyzeStateDeclarations method, which now checks the generic type parameter of ValueState for Map, List, or Set and provides specific recommendations for specialized state types.
  • sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/reflect/DoFnSignaturesTest.java
    • Added new DoFn test classes (DoFnWithMapValueState, DoFnWithListValueState, DoFnWithSetValueState, DoFnWithSimpleValueState) to simulate various ValueState declarations.
    • Created corresponding test methods (testValueStateWithMapLogsWarning, testValueStateWithListLogsWarning, testValueStateWithSetLogsWarning, testValueStateWithSimpleTypeNoWarning) to verify that warnings are triggered correctly for collection types and not for simple types, while ensuring the signature parsing remains functional.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

Assigning reviewers:

R: @kennknowles for label java.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Eliaaazzz
Copy link
Contributor

Eliaaazzz commented Feb 6, 2026

Summary

This PR adds a warning when users declare ValueState with collection types (Map, List, Set) that could benefit from using specialized state types for better performance.

Problem: When users store collections in ValueState, the entire collection must be read and written on each access. This can cause significant performance issues for large collections.

Solution: Log a warning during pipeline construction suggesting:

  • ValueState<Map> → Use MapState instead
  • ValueState<List> → Use BagState or OrderedListState instead
  • ValueState<Set> → Use SetState instead

Changes:

  • DoFnSignatures.java: Added warnIfValueStateContainsCollection() method that inspects state declarations and logs warnings for collection types
  • DoFnSignaturesTest.java: Added test cases to verify the warning logic works correctly

Fixes #36746

Test plan

  • Added tests for ValueState<Map>, ValueState<List>, ValueState<Set>
  • Added test for simple ValueState<String> (no warning expected)
  • Existing tests should continue to pass

🤖 Generated with Claude Code

Hi Jason,this looks like a great move for scalability! Promoting BagState/MapState is definitely the right direction. Promoting BagState/MapState is definitely the right direction for scalability.

However, I'd like to bring up a small concern about potential false positives. In scenarios where the collection is small or the workflow requires atomic full overwrites (read-modify-write the entire list), ValueState<List> can sometimes be more performant than BagState because it involves a single RPC round-trip for serialization, avoiding the overhead of paging/iterators associated with BagState.

If we log a warning for all collection types, it might unintentionally lead users to refactor efficient code (for their specific small-data use case) into less efficient patterns.

Do you think it might be worth mentioning this trade-off in the log message, or perhaps restricting the warning to cases where we can detect partial updates (though I know that's hard to catch at graph construction time)? Just want to make sure we don't alarm users who are intentionally using ValueState for small atomic collections.

@PDGGK
Copy link
Contributor Author

PDGGK commented Feb 7, 2026

Thanks @Eliaaazzz, this is a very good point!

You're absolutely right that ValueState<List> can be preferable for:

  • Small collections: Single RPC round-trip, no iterator/paging overhead
  • Atomic full-overwrite patterns: Read-modify-write the entire list as a unit

I'll make the following changes:

  1. Downgrade to INFO: Change LOG.warn to LOG.info - this is a performance hint, not a correctness issue
  2. Refine message: Make the trade-off explicit:
    • Add: "This is appropriate for small collections or atomic replacement"
    • Keep: "For large collections or frequent appends, consider using..."
  3. Add runner caveat: Note that specialized state types (MapState/SetState) may not be supported by all runners

This way we guide new users without alarming experienced users who intentionally use this pattern for valid use cases.

Thanks for the thorough review!

PDGGK added 2 commits February 7, 2026 13:55
…e-offs

- Change LOG.warn to LOG.info (performance hint, not correctness issue)
- Clarify that ValueState is appropriate for small collections or atomic replacement
- Add runner compatibility caveat for specialized state types
- Address community feedback from @Eliaaazzz
- Use TypeDescriptor.resolveType() instead of raw Type manipulation
- Use hasUnresolvedParameters() instead of instanceof checks
- Use isSubtypeOf() for collection type detection
- Remove catch-all Exception block entirely

Addresses @kennknowles code review comments
@PDGGK
Copy link
Contributor Author

PDGGK commented Feb 7, 2026

Thanks for the review! I've addressed both comments:

  1. Removed catch-all block: No longer needed - using defensive TypeDescriptor API calls
  2. Using TypeDescriptor methods throughout:
    • resolveType() to extract ValueState's type parameter
    • hasUnresolvedParameters() instead of instanceof checks
    • isSubtypeOf() for collection type detection

This keeps all logic within the TypeDescriptor API as suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Warn when a user's ValueState looks like it could use a better state type

3 participants