Skip to content

Commit b5d5e99

Browse files
committed
feedback
1 parent b483f24 commit b5d5e99

File tree

2 files changed

+22
-17
lines changed

2 files changed

+22
-17
lines changed

docs/source/en/optimization/memory.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,9 @@ The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usag
303303

304304
## Layerwise casting
305305

306+
> [!TIP]
307+
> Combine layerwise casting with [group offloading](#group-offloading) for even more memory savings.
308+
306309
Layerwise casting stores weights in a smaller data format (for example, `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision like `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.
307310

308311
> [!WARNING]

docs/source/en/optimization/speed-memory-optims.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,14 @@ For image generation, combining quantization and [model offloading](./memory#mod
2020

2121
For video generation, combining quantization and [group-offloading](./memory#group-offloading) tends to be better because video models are more compute-bound.
2222

23-
The table below provides a comparison of optimization strategy combinations and their impact on latency and memory-usage for Flux and Wan.
23+
The table below provides a comparison of optimization strategy combinations and their impact on latency and memory-usage for Flux.
2424

2525
| combination | latency (s) | memory-usage (GB) |
2626
|---|---|---|
27-
| quantization (Flux) | 32.602 | 14.9453 |
28-
| quantization, torch.compile (Flux) | 25.847 | 14.9448 |
29-
| quantization, torch.compile, model CPU offloading (Flux) | 32.312 | 12.2369 |
30-
| quantization, torch.compile, group offloading (Wan) | | |
31-
<small>These results are benchmarked on Flux and Wan with a RTX 4090. The `transformer` and `text_encoder` components are quantized. Refer to the <a href="https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d" benchmarking script</a> if you're interested in evaluating your own model.</small>
27+
| quantization | 32.602 | 14.9453 |
28+
| quantization, torch.compile | 25.847 | 14.9448 |
29+
| quantization, torch.compile, model CPU offloading | 32.312 | 12.2369 |
30+
<small>These results are benchmarked on Flux with a RTX 4090. The `transformer` and `text_encoder` components are quantized. Refer to the <a href="https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d" benchmarking script</a> if you're interested in evaluating your own model.</small>
3231

3332
This guide will show you how to compile and offload a quantized model with [bitsandbytes](../quantization/bitsandbytes#torchcompile). Make sure you are using [PyTorch nightly](https://pytorch.org/get-started/locally/) and the latest version of bitsandbytes.
3433

@@ -40,14 +39,14 @@ pip install -U bitsandbytes
4039

4140
Start by [quantizing](../quantization/overview) a model to reduce the memory required for storage and [compiling](./fp16#torchcompile) it to accelerate inference.
4241

43-
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) cache size to allow recompiling up to a limit in case some guards fail.
42+
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bnb models with `fullgraph=True`.
4443

4544
```py
4645
import torch
4746
from diffusers import DiffusionPipeline
4847
from diffusers.quantizers import PipelineQuantizationConfig
4948

50-
torch._dynamo.config.cache_size_limit = 1000
49+
torch._dynamo.config.capture_dynamic_output_shape_ops = True
5150

5251
# quantize
5352
pipeline_quant_config = PipelineQuantizationConfig(
@@ -75,7 +74,7 @@ pipeline("""
7574

7675
In addition to quantization and torch.compile, try offloading if you need to reduce memory-usage further. Offloading moves various layers or model components from the CPU to the GPU as needed for computations.
7776

78-
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) cache size to allow recompiling up to a limit in case some guards fail.
77+
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `cache_size_limit` during offloading to avoid excessive recompilation.
7978

8079
<hfoptions id="offloading">
8180
<hfoption id="model CPU offloading">
@@ -106,7 +105,7 @@ pipeline.enable_model_cpu_offload()
106105

107106
# compile
108107
pipeline.transformer.to(memory_format=torch.channels_last)
109-
pipeline.transformer.compile( mode="max-autotune", fullgraph=True)
108+
pipeline.transformer.compile()
110109
pipeline(
111110
"cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
112111
).images[0]
@@ -153,25 +152,28 @@ offload_device = torch.device("cpu")
153152
pipeline.transformer.enable_group_offload(
154153
onload_device=onload_device,
155154
offload_device=offload_device,
156-
offload_type="block_level",
157-
num_blocks_per_group=4
155+
offload_type="leaf_level",
156+
use_stream=True,
157+
non_blocking=True
158158
)
159159
pipeline.vae.enable_group_offload(
160160
onload_device=onload_device,
161161
offload_device=offload_device,
162-
offload_type="block_level",
163-
num_blocks_per_group=4
162+
offload_type="leaf_level",
163+
use_stream=True,
164+
non_blocking=True
164165
)
165166
apply_group_offloading(
166167
pipeline.text_encoder,
167168
onload_device=onload_device,
168-
offload_type="block_level",
169-
num_blocks_per_group=2
169+
offload_type="leaf_level",
170+
use_stream=True,
171+
non_blocking=True
170172
)
171173

172174
# compile
173175
pipeline.transformer.to(memory_format=torch.channels_last)
174-
pipeline.transformer.compile( mode="max-autotune", fullgraph=True)
176+
pipeline.transformer.compile()
175177

176178
prompt = """
177179
The camera rushes from far to near in a low-angle shot,

0 commit comments

Comments
 (0)