fix: restore module buffers during stream-based group offload by Chase-Xuu · Pull Request #13238 · huggingface/diffusers

Chase-Xuu · 2026-03-10T01:43:26Z

Description

In _offload_to_memory(), when using CUDA streams (self.stream is not None), module buffers from self.modules were not being restored to their CPU tensor copies during offload. This created an asymmetry with both _build_cpu_param_dict() and _process_tensors_from_modules() (used during onload), which correctly iterate over group_module.buffers().

The bug

Method	`group_module.parameters()`	`group_module.buffers()`	`self.parameters`	`self.buffers`
`_build_cpu_param_dict`	✅	✅	✅	✅
`_process_tensors_from_modules` (onload)	✅	✅	✅	✅
`_offload_to_memory` (stream path)	✅	❌ missing	✅	✅
`_offload_to_memory` (non-stream path)	✅ (via `.to()`)	✅ (via `.to()`)	✅	✅

The non-stream path uses group_module.to(self.offload_device) which correctly moves all parameters and buffers. The stream path manually iterates but was missing the buffer loop.

Impact

Module buffers (e.g., running_mean/running_var in normalization layers) remain on GPU after offload
On the next onload cycle, stale GPU buffer data may be used instead of the correct CPU copies
This could contribute to NaN values when using record_stream=True with group offloading (related: wan 2.2 cause nan in latent in i2v #12613)
Minor GPU memory leak from unreleased buffer tensors

Fix

Added the missing group_module.buffers() loop in the stream path of _offload_to_memory(), making it symmetric with the onload path.

Related Issues

wan 2.2 cause nan in latent in i2v #12613 (NaN latents with record_stream=True in group offloading)

In `_offload_to_memory()`, when using CUDA streams, module buffers from `self.modules` were not being restored to their CPU copies. This created an asymmetry with `_build_cpu_param_dict()` and `_process_tensors_from_modules()` (onload), which both handle `group_module.buffers()`. The missing buffer restoration could cause: - Stale buffer data on subsequent onload cycles - Memory leaks (GPU tensors not released) - Potential NaN values in models with stateful buffers (e.g., normalization layers) when used with `record_stream=True` Fixes the stream path to match the non-stream path, which correctly moves all module state via `group_module.to()`. Related: huggingface#12613 Signed-off-by: Chase Xu <chase_xu@outlook.com> Signed-off-by: Chase Xu <80196056+Chase-Xuu@users.noreply.github.com>

Chase-Xuu · 2026-03-10T15:07:30Z

Closing to comply with one-PR-per-project policy. Will resubmit this buffer offload fix after #13240 is resolved. The stream-based group offload buffer issue is real — I'll track it separately.

Chase-Xuu closed this Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: restore module buffers during stream-based group offload#13238

fix: restore module buffers during stream-based group offload#13238
Chase-Xuu wants to merge 1 commit intohuggingface:mainfrom
Chase-Xuu:fix/group-offload-missing-module-buffers

Chase-Xuu commented Mar 10, 2026

Uh oh!

Chase-Xuu commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chase-Xuu commented Mar 10, 2026

Description

The bug

Impact

Fix

Related Issues

Uh oh!

Chase-Xuu commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant