Skip to content

[Java SDK] Warn when ValueState contains collection types#37523

Closed
PDGGK wants to merge 4 commits intoapache:masterfrom
PDGGK:warn-valuestate-collection
Closed

[Java SDK] Warn when ValueState contains collection types#37523
PDGGK wants to merge 4 commits intoapache:masterfrom
PDGGK:warn-valuestate-collection

Conversation

@PDGGK
Copy link
Contributor

@PDGGK PDGGK commented Feb 5, 2026

Summary

This PR adds a warning when users declare ValueState with collection types (Map, List, Set) that could benefit from using specialized state types for better performance.

Problem:
When users store collections in ValueState, the entire collection must be read and written on each access. This can cause significant performance issues for large collections.

Solution:
Log a warning during pipeline construction suggesting:

  • ValueState<Map> → Use MapState instead
  • ValueState<List> → Use BagState or OrderedListState instead
  • ValueState<Set> → Use SetState instead

Changes:

  • DoFnSignatures.java: Added warnIfValueStateContainsCollection() method that inspects state declarations and logs warnings for collection types
  • DoFnSignaturesTest.java: Added test cases to verify the warning logic works correctly

Fixes #36746

Test plan

  • Added tests for ValueState<Map>, ValueState<List>, ValueState<Set>
  • Added test for simple ValueState<String> (no warning expected)
  • Existing tests should continue to pass

🤖 Generated with Claude Code

PDGGK added 3 commits February 5, 2026 22:03
This change fixes the validation logic in IcebergScanConfig to support
nested column paths using dot notation (e.g., "data.name").

Previously, the validation only checked top-level column names, causing
nested paths like "colA.colB" to fail with "unknown field(s)" error.

The fix uses Iceberg's Schema.findField() which natively resolves
dot-notation paths for nested fields.

Fixes apache#37486
- Use TypeUtil.indexByName() to enumerate all field paths
- Only select leaf fields to prevent parent struct from including dropped children
- Add test for nested drop validation
When users declare ValueState<Map>, ValueState<List>, or ValueState<Set>,
log a warning suggesting they use MapState, BagState, or SetState instead.

Storing collections in ValueState requires reading and writing the entire
collection on each access, which can cause performance issues for large
collections. The specialized state types provide better performance.

Fixes apache#36746
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @PDGGK, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves developer guidance in the Apache Beam Java SDK by introducing a warning for inefficient ValueState usage with collection types, promoting better performance through specialized state. Concurrently, it refines the Iceberg I/O connector's schema projection to correctly process nested column paths during field dropping, enhancing data filtering accuracy.

Highlights

  • ValueState Collection Warning: Introduced a new warning in DoFnSignatures to alert users when ValueState is declared with collection types (Map, List, Set), recommending specialized state types for performance.
  • Iceberg Nested Column Pruning: Enhanced Iceberg connector's schema projection to accurately handle dropping nested columns, ensuring only specified fields and their children are excluded.
  • New Unit Tests for ValueState Warnings: Added comprehensive unit tests for the new ValueState collection warning logic, covering various collection types and simple types.
  • New Tests for Iceberg Nested Column Pruning: Included new tests for Iceberg I/O to validate correct behavior of nested column pruning with both keep and drop configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/reflect/DoFnSignatures.java
    • Integrated SLF4J logging.
    • Added warnIfValueStateContainsCollection method to identify and log warnings for ValueState instances holding Map, List, or Set types, suggesting MapState, BagState/OrderedListState, or SetState respectively.
    • Invoked the new warning method during the analysis of state declarations.
  • sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/reflect/DoFnSignaturesTest.java
    • Introduced new DoFn classes for testing ValueState with Map, List, Set, and String types.
    • Added test methods (testValueStateWithMapLogsWarning, testValueStateWithListLogsWarning, testValueStateWithSetLogsWarning, testValueStateWithSimpleTypeNoWarning) to verify the warning mechanism's behavior.
  • sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergScanConfig.java
    • Updated the resolveSchema method to correctly handle drop configurations for nested fields. It now identifies leaf fields and removes them or their children based on the drop list, preventing unintended dropping of parent structs.
    • Modified the validate method to use table.schema().findField(name) for validating dropped fields, allowing support for nested column paths (e.g., "colA.colB").
  • sdks/java/io/iceberg/src/test/java/org/apache/beam/sdk/io/iceberg/IcebergIOReadTest.java
    • Added testNestedColumnPruningValidation to confirm that nested column paths are correctly handled when specified in the keep configuration.
    • Added testNestedColumnDropValidation to ensure that dropping nested fields works as expected, verifying that the specified nested field is removed from the projected schema.
Activity
  • Tests were added for ValueState with Map, List, and Set collection types.
  • A test was included for ValueState with a simple String type, expecting no warning.
  • The author confirmed that existing tests should continue to pass.
  • The PR description indicates it was generated with Claude Code.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@PDGGK
Copy link
Contributor Author

PDGGK commented Feb 6, 2026

Closing to recreate with clean branch (previous branch had unrelated Iceberg changes mixed in)

@PDGGK PDGGK closed this Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Warn when a user's ValueState looks like it could use a better state type

1 participant