[rollout] fix: memory release bug#5917
[rollout] fix: memory release bug#5917hustmf wants to merge 1 commit intoverl-project:release/v0.7.1from
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the weight transfer cleanup logic in bucketed_weight_transfer.py. The review feedback highlights a potential NameError due to the removal of the tensor initialization and suggests a more effective way to release memory by using weights.clear() instead of iterating through the list to delete local references.
| while True: | ||
| metadata = self.socket.recv_pyobj() | ||
| weights, tensor = [], None | ||
| weights = [] |
There was a problem hiding this comment.
Initializing tensor to None is recommended to ensure the variable is defined in the function scope. This prevents a potential NameError during the cleanup phase (line 261) if the bucket metadata is empty and the loop at line 243 never executes.
| weights = [] | |
| weights, tensor = [], None |
| for _, tensor in weights: | ||
| del tensor | ||
| weights.clear() | ||
| del weights |
There was a problem hiding this comment.
The loop iterating over weights to call del tensor is ineffective for releasing memory of the objects within the list; it only unbinds the local loop variable in each iteration. To properly release the tensors and address the memory issue, weights.clear() should be used. This ensures that even if external references to the weights list exist (e.g., within the callback), the tensors themselves can be garbage collected. A single del tensor is then sufficient to clear the reference leaked from the previous loop.
| for _, tensor in weights: | |
| del tensor | |
| weights.clear() | |
| del weights | |
| weights.clear() | |
| del weights, tensor |
What does this PR do?
The tensor variable introduced by pr https://github.com/verl-project/verl/pull/5173/changes will lead to possible memory release bug. The vllmworker will hold the tensor when enabling sleep mode
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.