Skip to content

fix: Destroy the previously created accumulator for AggregateWindow to fix memory leak#15679

Open
jiangjiangtian wants to merge 2 commits intofacebookincubator:mainfrom
jiangjiangtian:destroy_value
Open

fix: Destroy the previously created accumulator for AggregateWindow to fix memory leak#15679
jiangjiangtian wants to merge 2 commits intofacebookincubator:mainfrom
jiangjiangtian:destroy_value

Conversation

@jiangjiangtian
Copy link

@jiangjiangtian jiangjiangtian commented Dec 3, 2025

One example I found is as follows: if the method initializeNewGroupsInternal in FirstLastAggregateBase is called multiple times and the function is not numeric, the previously created SingleValueAccumulator object will be overwritten without calling destroy to free its memory. This leads to memory leak and eventually causes OOM. This situation will occur in Window operator when there are multiple window frames to process.

So this PR fixes the issue by calling destroy on rawSingleGroupRow_ when processing a new window frame.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 3, 2025
@netlify
Copy link

netlify bot commented Dec 3, 2025

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit e6e7d0a
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69ae97ddc068270008cc3fa1
😎 Deploy Preview https://deploy-preview-15679--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@jiangjiangtian jiangjiangtian changed the title fix: Destroy SingleValueAccumulator when there exists in FirstLastAggregateBase to fix memory leak fix: Destroy previous accumulator when group has been initialized to fix memory leak Dec 3, 2025
@jiangjiangtian jiangjiangtian changed the title fix: Destroy previous accumulator when group has been initialized to fix memory leak fix: Destroy previous created accumulator when group has been initialized to fix memory leak Dec 3, 2025
@jiangjiangtian jiangjiangtian changed the title fix: Destroy previous created accumulator when group has been initialized to fix memory leak fix: Destroy the previously created accumulator for AggregateWindow to fix memory leak Dec 3, 2025
@jinchengchenghh
Copy link
Collaborator

I find this code https://github.com/facebookincubator/velox/pull/5632/files#diff-b9ed2eb619a914241f656abe59881d169602eed734af2df443f6480d611850aaR122-R124
to call destroy directly, maybe your exception is caused by not calling the destroy directly.

Also, maybe fix that in virtual void initializeNewGroups(
char** groups,
folly::Range<const vector_size_t*> indices) {
initializeNewGroupsInternal(groups, indices);

for (auto index : indices) {
  groups[index][initializedByte_] |= initializedMask_;
}

}
This can help eliminate the memory leak issue.
Also, does this problem only occurs when isFixedSize() false?

  virtual bool isFixedSize() const {
    return true;
  }

The initializeNewGroupsInternal should only be called when the condition is satisfied.

@jiangjiangtian
Copy link
Author

I find this code https://github.com/facebookincubator/velox/pull/5632/files#diff-b9ed2eb619a914241f656abe59881d169602eed734af2df443f6480d611850aaR122-R124 to call destroy directly, maybe your exception is caused by not calling the destroy directly.

Yes, now AggregateWindow will only create a new accumulator but without destroying it.

Also, maybe fix that in virtual void initializeNewGroups( char** groups, folly::Range<const vector_size_t*> indices) { initializeNewGroupsInternal(groups, indices);

for (auto index : indices) {
  groups[index][initializedByte_] |= initializedMask_;
}

} This can help eliminate the memory leak issue.

Do you means that if the group has been initialized, then this function should not call initializeNewGroupsInternal again? But when we process a new window frame, we need to initialize the group again to clear the result of the latest window frame. So I call destroy in AggregateWindow

Also, does this problem only occurs when isFixedSize() false?

Maybe not. I find that NonNumericArbitrary does not override isFixedSize but we have to call destroy on it.

  virtual bool isFixedSize() const {
    return true;
  }

The initializeNewGroupsInternal should only be called when the condition is satisfied.

So this may be not correct.
Thanks for your reply.

@jinchengchenghh
Copy link
Collaborator

I mean destroy the allocator in initializeNewGroups not initializeNewGroupsInternal, this can help all the group initialization issue, but looks like community prefers to directly call the destroy function, so let us align with the current solution, find one and solve one.

@jinchengchenghh
Copy link
Collaborator

In the code and review, we may be carefully call the initializeNewGroups, decide if it should call destroy first

@jinchengchenghh jinchengchenghh self-requested a review December 3, 2025 12:22
Copy link
Collaborator

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your fix!

@jiangjiangtian
Copy link
Author

@mbasmanova @xiaoxmeng @Yuhta Could you please help review this PR?

// aggregate_ function object should be initialized.
auto singleGroup = std::vector<vector_size_t>{0};
aggregate_->clear();
aggregate_->destroy(folly::Range<char**>(&rawSingleGroupRow_, 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to call 'clear' if we are calling 'destroy'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need. clear clears some states of the aggregate function and when we start processing a new window frame, we need to clear the states.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangjiangtian @mbasmanova I too feel its confusing to call both these. Why does clearing an aggregate not imply destroying all the groups ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add call to destroy in clear?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova @aditi-pandit how do you think? should we add call to destroy in clear?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangjiangtian clear() is clearing the state, but doesn't free memory. Similar to std::vector::clear. destroy() frees up memory (and also clears the state). Similar to std::vector's destructor. I would imagine it is sufficient to call destroy(). Calling clear() before destroy() is redundant. Is this not the case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Now destroy won't clear the states of Aggregate functions. So should we call clear() in destroy()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

destroy won't clear the states

That's strange. Why would we need to explicitly clear the state if are releasing memory used by these. Perhaps, there are bugs in some implementations of destroy. What happens when you remove 'clear' call before 'destroy'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when you remove 'clear' call before 'destroy'?

Sorry, I have not tried it yet. But I find that 'clear' just sets the null count in Aggregate to 0 and the null count is just used to determine whether we need to actually extract bits from 'group' when we check if the 'group' is null or when we want to clear null bits of the 'group'.
After reading some code, I realize that we have to call clearNull in destroy. I think that is what we need to do. Before we destroy the accumulator, we need to check whether it is null. If it is, we need to subtract it from the null count in Aggregate to keep it aligned with the number of null groups allocated. Now we don't perform this check, so maybe that's the bug in destroy.
Based on the investigation above, I agree with you that the 'clear' call is confusing. So if we call clearNull in destroy, we can resolve the confusion. What do you think? Thanks! cc @mbasmanova

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangjiangtian Yes, I believe you are correct that destroy needs to clear the null flag.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it would be possible to reproduce this issue with a simple test and verify it is fixed?

@mbasmanova
Copy link
Contributor

CC: @jiangjiangtian

@jiangjiangtian jiangjiangtian force-pushed the destroy_value branch 2 times, most recently from cf703b3 to 56d83c9 Compare March 4, 2026 03:53
@jiangjiangtian
Copy link
Author

@mbasmanova I iterate this PR based on #15680. Please take a look again, thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangjiangtian Would you confirm that the test fails without the changes?

@mbasmanova
Copy link
Contributor

/claude-review

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

✅ Claude Code Review

Requested by @mbasmanova


Summary

This PR addresses a legitimate memory leak in the AggregateWindow operator by properly destroying previously created accumulators before reinitializing them. The fix is correct and necessary for memory safety.

Issues Found

🟢 Positive: Fix addresses real memory leak

The PR correctly identifies and fixes a memory leak where SingleValueAccumulator objects were being overwritten without proper destruction in window operations. The fix ensures destroy() is called before initializeNewGroups().

🟢 Positive: Proper API usage

The change from aggregate_->clear() to aggregate_->destroy() is the correct approach:

  • clear() only resets internal counters (numNulls_ etc.) but doesn't free accumulator memory
  • destroy() properly calls destroyInternal() which handles cleanup of non-numeric accumulators

🟢 Positive: Test demonstrates fix effectiveness

The test changes show the memory capacity was reduced from 256MB to 192MB, demonstrating that the fix actually reduces memory usage as expected.

🟢 Positive: Thread safety maintained

The changes don't introduce any concurrency issues. The rawSingleGroupRow_ is accessed consistently within the same thread context.

🟢 Positive: Consistent fix across code paths

Both incremental and non-incremental aggregation code paths in AggregateWindow.cpp were updated consistently (lines 153→destroy and 339→destroy).

Edge Cases Examined

  1. Null handling: ✅ The added clearNull(group) in the destroy method ensures null state is properly reset
  2. Multiple reinitializations: ✅ The fix works correctly when initializeNewGroupsInternal is called multiple times
  3. Memory pressure scenarios: ✅ Test validates behavior under constrained memory
  4. Exception safety: ✅ If initializeNewGroups throws after destroy, the state remains consistent

Performance Impact

The performance impact should be minimal since:

  • destroy() was already being called in other code paths
  • The operations are O(1) per group
  • Memory savings likely outweigh any marginal cost increase

Correctness Verification

I traced through the logic:

  1. destroy() calls destroyInternal() which properly frees SingleValueAccumulator memory for non-numeric types
  2. Resets the initialized flag (group[initializedByte_] &= ~initializedMask_)
  3. Calls clearNull() to reset null state
  4. initializeNewGroups() can then safely create new accumulators

The fix is correct and complete. The memory leak is properly addressed without introducing new issues.


ℹ️ About this response

This review was generated by Claude Code. It analyzed the PR diff and codebase to provide feedback.

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Available commands:

  • /claude-review - Request a code review
  • /claude-query <question> - Ask a question about the PR or codebase

@jiangjiangtian
Copy link
Author

jiangjiangtian commented Mar 9, 2026

@jiangjiangtian Would you confirm that the test fails without the changes?

@mbasmanova Yes! The test will fail if I revert this PR or #15680.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants