From f18e04c3d61c284dba574060da41ad07b2386254 Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Mon, 19 May 2025 12:14:05 +0530 Subject: [PATCH 1/2] tip for group offloding + quantization Co-authored-by: Aryan VS --- docs/source/en/optimization/memory.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md index 5b3bfe650d74..8e4edbb89090 100644 --- a/docs/source/en/optimization/memory.md +++ b/docs/source/en/optimization/memory.md @@ -295,6 +295,13 @@ pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_d The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usage when using streams during group offloading. It is best for `leaf_level` offloading and when CPU memory is bottlenecked. Memory is saved by creating pinned tensors on the fly instead of pre-pinning them. However, this may increase overall execution time. + + +The offloading strategies can be combined with [quantization](../quantization/overview.md) to enable further memory savings. For image generation, combining [quantization and model offloading](#model-offloading) can often give the best trade-off between quality, speed, and memory. However, for video generation, as the models are more +compute-bound, [group-offloading](#group-offloading) tends to be better. Group offloading benefits considerably from overlapping weight transfers and computation. When applying group offloading with quantization on image generation models at typical resolutions (1024x1024, for example), it usually cannot overlap weight transfer if the compute kernel finishes before weight transfer, making it communication bound between CPU/GPU. + + + ## Layerwise casting Layerwise casting stores weights in a smaller data format (for example, `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision like `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality. From 67505577c2cdd738d9039f8328ee35dd7b6c2e7c Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Mon, 19 May 2025 14:42:21 +0530 Subject: [PATCH 2/2] Apply suggestions from code review Co-authored-by: Aryan --- docs/source/en/optimization/memory.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md index 8e4edbb89090..6b853a7a084b 100644 --- a/docs/source/en/optimization/memory.md +++ b/docs/source/en/optimization/memory.md @@ -298,7 +298,7 @@ The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usag The offloading strategies can be combined with [quantization](../quantization/overview.md) to enable further memory savings. For image generation, combining [quantization and model offloading](#model-offloading) can often give the best trade-off between quality, speed, and memory. However, for video generation, as the models are more -compute-bound, [group-offloading](#group-offloading) tends to be better. Group offloading benefits considerably from overlapping weight transfers and computation. When applying group offloading with quantization on image generation models at typical resolutions (1024x1024, for example), it usually cannot overlap weight transfer if the compute kernel finishes before weight transfer, making it communication bound between CPU/GPU. +compute-bound, [group-offloading](#group-offloading) tends to be better. Group offloading provides considerable benefits when weight transfers can be overlapped with computation (must use streams). When applying group offloading with quantization on image generation models at typical resolutions (1024x1024, for example), it is usually not possible to *fully* overlap weight transfers if the compute kernel finishes faster, making it communication bound between CPU/GPU (due to device synchronizations).