You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/optimization/memory.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -303,6 +303,9 @@ The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usag
303
303
304
304
## Layerwise casting
305
305
306
+
> [!TIP]
307
+
> Combine layerwise casting with [group offloading](#group-offloading) for even more memory savings.
308
+
306
309
Layerwise casting stores weights in a smaller data format (for example, `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision like `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.
Copy file name to clipboardExpand all lines: docs/source/en/optimization/speed-memory-optims.md
+19-17Lines changed: 19 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,15 +20,14 @@ For image generation, combining quantization and [model offloading](./memory#mod
20
20
21
21
For video generation, combining quantization and [group-offloading](./memory#group-offloading) tends to be better because video models are more compute-bound.
22
22
23
-
The table below provides a comparison of optimization strategy combinations and their impact on latency and memory-usage for Flux and Wan.
23
+
The table below provides a comparison of optimization strategy combinations and their impact on latency and memory-usage for Flux.
| quantization, torch.compile, model CPU offloading (Flux) | 32.312 | 12.2369 |
30
-
| quantization, torch.compile, group offloading (Wan) |||
31
-
<small>These results are benchmarked on Flux and Wan with a RTX 4090. The `transformer` and `text_encoder` components are quantized. Refer to the <a href="https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d" benchmarking script</a> if you're interested in evaluating your own model.</small>
| quantization, torch.compile, model CPU offloading | 32.312 | 12.2369 |
30
+
<small>These results are benchmarked on Flux with a RTX 4090. The `transformer` and `text_encoder` components are quantized. Refer to the <a href="https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d" benchmarking script</a> if you're interested in evaluating your own model.</small>
32
31
33
32
This guide will show you how to compile and offload a quantized model with [bitsandbytes](../quantization/bitsandbytes#torchcompile). Make sure you are using [PyTorch nightly](https://pytorch.org/get-started/locally/) and the latest version of bitsandbytes.
34
33
@@ -40,14 +39,14 @@ pip install -U bitsandbytes
40
39
41
40
Start by [quantizing](../quantization/overview) a model to reduce the memory required for storage and [compiling](./fp16#torchcompile) it to accelerate inference.
42
41
43
-
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html)cache size to allow recompiling up to a limit in case some guards fail.
42
+
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html)`capture_dynamic_output_shape_ops = True`to handle dynamic outputs when compiling bnb models with `fullgraph=True`.
44
43
45
44
```py
46
45
import torch
47
46
from diffusers import DiffusionPipeline
48
47
from diffusers.quantizers import PipelineQuantizationConfig
In addition to quantization and torch.compile, try offloading if you need to reduce memory-usage further. Offloading moves various layers or model components from the CPU to the GPU as needed for computations.
77
76
78
-
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html)cache size to allow recompiling up to a limit in case some guards fail.
77
+
Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html)`cache_size_limit` during offloading to avoid excessive recompilation.
"cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
0 commit comments