Skip to content

Conversation

@prashantwason
Copy link
Member

Describe the issue this Pull Request addresses

This PR optimizes the bloom index key lookup by using Set instead of ArrayList for storing candidate record keys. ArrayList has large memory overhead which occurs when the ArrayList grows beyond its initially allocated size. Set is better suited to an exists check and avoids the need to copy the collection when calling filterRowKeys().

Summary and Changelog

  • Changed candidateRecordKeys in HoodieKeyLookupHandle from ArrayList<String> to HashSet<String>
  • Updated filterKeysFromFile method signature in HoodieIndexUtils to accept Set<String> instead of List<String>
  • Removed unnecessary .stream().collect(Collectors.toSet()) call since the input is already a Set

Impact

No public API changes. This is an internal optimization that reduces memory overhead when looking up a large number of keys during bloom index operations.

Risk Level

low - This is a straightforward type change from ArrayList to HashSet with no behavioral changes. The Set semantics are actually more appropriate since we're checking for key existence.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…ge number of keys.

ArrayList has large memory overhead which occurs when the ArrayList grows beyond its initially allocated size. Set is better suited to a exists check.
@hudi-bot
Copy link
Collaborator

hudi-bot commented Jan 4, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Jan 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants