Skip to content

fix: restore module buffers during stream-based group offload#13238

Closed
Chase-Xuu wants to merge 1 commit intohuggingface:mainfrom
Chase-Xuu:fix/group-offload-missing-module-buffers
Closed

fix: restore module buffers during stream-based group offload#13238
Chase-Xuu wants to merge 1 commit intohuggingface:mainfrom
Chase-Xuu:fix/group-offload-missing-module-buffers

Conversation

@Chase-Xuu
Copy link

Description

In _offload_to_memory(), when using CUDA streams (self.stream is not None), module buffers from self.modules were not being restored to their CPU tensor copies during offload. This created an asymmetry with both _build_cpu_param_dict() and _process_tensors_from_modules() (used during onload), which correctly iterate over group_module.buffers().

The bug

Method group_module.parameters() group_module.buffers() self.parameters self.buffers
_build_cpu_param_dict
_process_tensors_from_modules (onload)
_offload_to_memory (stream path) missing
_offload_to_memory (non-stream path) ✅ (via .to()) ✅ (via .to())

The non-stream path uses group_module.to(self.offload_device) which correctly moves all parameters and buffers. The stream path manually iterates but was missing the buffer loop.

Impact

  • Module buffers (e.g., running_mean/running_var in normalization layers) remain on GPU after offload
  • On the next onload cycle, stale GPU buffer data may be used instead of the correct CPU copies
  • This could contribute to NaN values when using record_stream=True with group offloading (related: wan 2.2 cause nan in latent in i2v #12613)
  • Minor GPU memory leak from unreleased buffer tensors

Fix

Added the missing group_module.buffers() loop in the stream path of _offload_to_memory(), making it symmetric with the onload path.

Related Issues

In `_offload_to_memory()`, when using CUDA streams, module buffers from
`self.modules` were not being restored to their CPU copies. This created
an asymmetry with `_build_cpu_param_dict()` and
`_process_tensors_from_modules()` (onload), which both handle
`group_module.buffers()`.

The missing buffer restoration could cause:
- Stale buffer data on subsequent onload cycles
- Memory leaks (GPU tensors not released)
- Potential NaN values in models with stateful buffers (e.g., normalization
  layers) when used with `record_stream=True`

Fixes the stream path to match the non-stream path, which correctly moves
all module state via `group_module.to()`.

Related: huggingface#12613

Signed-off-by: Chase Xu <chase_xu@outlook.com>
Signed-off-by: Chase Xu <80196056+Chase-Xuu@users.noreply.github.com>
@Chase-Xuu
Copy link
Author

Closing to comply with one-PR-per-project policy. Will resubmit this buffer offload fix after #13240 is resolved. The stream-based group offload buffer issue is real — I'll track it separately.

@Chase-Xuu Chase-Xuu closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant