perf(index): Use Set instead of ArrayList to reduce memory overhead in key lookup #17774
+6
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Describe the issue this Pull Request addresses
This PR optimizes the bloom index key lookup by using
Setinstead ofArrayListfor storing candidate record keys. ArrayList has large memory overhead which occurs when the ArrayList grows beyond its initially allocated size. Set is better suited to an exists check and avoids the need to copy the collection when callingfilterRowKeys().Summary and Changelog
candidateRecordKeysinHoodieKeyLookupHandlefromArrayList<String>toHashSet<String>filterKeysFromFilemethod signature inHoodieIndexUtilsto acceptSet<String>instead ofList<String>.stream().collect(Collectors.toSet())call since the input is already a SetImpact
No public API changes. This is an internal optimization that reduces memory overhead when looking up a large number of keys during bloom index operations.
Risk Level
low - This is a straightforward type change from ArrayList to HashSet with no behavioral changes. The Set semantics are actually more appropriate since we're checking for key existence.
Documentation Update
none
Contributor's checklist