fix: restore module buffers during stream-based group offload#13238
Closed
Chase-Xuu wants to merge 1 commit intohuggingface:mainfrom
Closed
fix: restore module buffers during stream-based group offload#13238Chase-Xuu wants to merge 1 commit intohuggingface:mainfrom
Chase-Xuu wants to merge 1 commit intohuggingface:mainfrom
Conversation
In `_offload_to_memory()`, when using CUDA streams, module buffers from `self.modules` were not being restored to their CPU copies. This created an asymmetry with `_build_cpu_param_dict()` and `_process_tensors_from_modules()` (onload), which both handle `group_module.buffers()`. The missing buffer restoration could cause: - Stale buffer data on subsequent onload cycles - Memory leaks (GPU tensors not released) - Potential NaN values in models with stateful buffers (e.g., normalization layers) when used with `record_stream=True` Fixes the stream path to match the non-stream path, which correctly moves all module state via `group_module.to()`. Related: huggingface#12613 Signed-off-by: Chase Xu <chase_xu@outlook.com> Signed-off-by: Chase Xu <80196056+Chase-Xuu@users.noreply.github.com>
Author
|
Closing to comply with one-PR-per-project policy. Will resubmit this buffer offload fix after #13240 is resolved. The stream-based group offload buffer issue is real — I'll track it separately. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
In
_offload_to_memory(), when using CUDA streams (self.stream is not None), module buffers fromself.moduleswere not being restored to their CPU tensor copies during offload. This created an asymmetry with both_build_cpu_param_dict()and_process_tensors_from_modules()(used during onload), which correctly iterate overgroup_module.buffers().The bug
group_module.parameters()group_module.buffers()self.parametersself.buffers_build_cpu_param_dict_process_tensors_from_modules(onload)_offload_to_memory(stream path)_offload_to_memory(non-stream path).to()).to())The non-stream path uses
group_module.to(self.offload_device)which correctly moves all parameters and buffers. The stream path manually iterates but was missing the buffer loop.Impact
running_mean/running_varin normalization layers) remain on GPU after offloadrecord_stream=Truewith group offloading (related: wan 2.2 cause nan in latent in i2v #12613)Fix
Added the missing
group_module.buffers()loop in the stream path of_offload_to_memory(), making it symmetric with the onload path.Related Issues
record_stream=Truein group offloading)