Skip to content

Commit 07fb33f

Browse files
authored
Merge branch 'main' into Support-CallBack-On-Step-End
2 parents e21f9c1 + 69f919d commit 07fb33f

File tree

51 files changed

+1347
-149
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+1347
-149
lines changed

docs/source/en/api/utilities.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,3 +45,7 @@ Utility and helper functions for working with 🤗 Diffusers.
4545
## apply_layerwise_casting
4646

4747
[[autodoc]] hooks.layerwise_casting.apply_layerwise_casting
48+
49+
## apply_group_offloading
50+
51+
[[autodoc]] hooks.group_offloading.apply_group_offloading

docs/source/en/optimization/memory.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,46 @@ In order to properly offload models after they're called, it is required to run
158158

159159
</Tip>
160160

161+
## Group offloading
162+
163+
Group offloading is the middle ground between sequential and model offloading. It works by offloading groups of internal layers (either `torch.nn.ModuleList` or `torch.nn.Sequential`), which uses less memory than model-level offloading. It is also faster than sequential-level offloading because the number of device synchronizations is reduced.
164+
165+
To enable group offloading, call the [`~ModelMixin.enable_group_offload`] method on the model if it is a Diffusers model implementation. For any other model implementation, use [`~hooks.group_offloading.apply_group_offloading`]:
166+
167+
```python
168+
import torch
169+
from diffusers import CogVideoXPipeline
170+
from diffusers.hooks import apply_group_offloading
171+
from diffusers.utils import export_to_video
172+
173+
# Load the pipeline
174+
onload_device = torch.device("cuda")
175+
offload_device = torch.device("cpu")
176+
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
177+
178+
# We can utilize the enable_group_offload method for Diffusers model implementations
179+
pipe.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True)
180+
181+
# For any other model implementations, the apply_group_offloading function can be used
182+
apply_group_offloading(pipe.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)
183+
apply_group_offloading(pipe.vae, onload_device=onload_device, offload_type="leaf_level")
184+
185+
prompt = (
186+
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
187+
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
188+
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
189+
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
190+
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
191+
"atmosphere of this unique musical performance."
192+
)
193+
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
194+
# This utilized about 14.79 GB. It can be further reduced by using tiling and using leaf_level offloading throughout the pipeline.
195+
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
196+
export_to_video(video, "output.mp4", fps=8)
197+
```
198+
199+
Group offloading (for CUDA devices with support for asynchronous data transfer streams) overlaps data transfer and computation to reduce the overall execution time compared to sequential offloading. This is enabled using layer prefetching with CUDA streams. The next layer to be executed is loaded onto the accelerator device while the current layer is being executed - this increases the memory requirements slightly. Group offloading also supports leaf-level offloading (equivalent to sequential CPU offloading) but can be made much faster when using streams.
200+
161201
## FP8 layerwise weight-casting
162202

163203
PyTorch supports `torch.float8_e4m3fn` and `torch.float8_e5m2` as weight storage dtypes, but they can't be used for computation in many different tensor operations due to unimplemented kernel support. However, you can use these dtypes to store model weights in fp8 precision and upcast them on-the-fly when the layers are used in the forward pass. This is known as layerwise weight-casting.

docs/source/en/training/custom_diffusion.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -339,7 +339,10 @@ import torch
339339
from huggingface_hub.repocard import RepoCard
340340
from diffusers import DiffusionPipeline
341341

342-
pipeline = DiffusionPipeline.from_pretrained("sayakpaul/custom-diffusion-cat-wooden-pot", torch_dtype=torch.float16).to("cuda")
342+
pipeline = DiffusionPipeline.from_pretrained(
343+
"CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16,
344+
).to("cuda")
345+
model_id = "sayakpaul/custom-diffusion-cat-wooden-pot"
343346
pipeline.unet.load_attn_procs(model_id, weight_name="pytorch_custom_diffusion_weights.bin")
344347
pipeline.load_textual_inversion(model_id, weight_name="<new1>.bin")
345348
pipeline.load_textual_inversion(model_id, weight_name="<new2>.bin")

src/diffusers/hooks/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33

44
if is_torch_available():
5+
from .group_offloading import apply_group_offloading
56
from .hooks import HookRegistry, ModelHook
67
from .layerwise_casting import apply_layerwise_casting, apply_layerwise_casting_hook
78
from .pyramid_attention_broadcast import PyramidAttentionBroadcastConfig, apply_pyramid_attention_broadcast

0 commit comments

Comments
 (0)