From f18e04c3d61c284dba574060da41ad07b2386254 Mon Sep 17 00:00:00 2001
From: sayakpaul <spsayakpaul@gmail.com>
Date: Mon, 19 May 2025 12:14:05 +0530
Subject: [PATCH 1/2] tip for group offloding + quantization

Co-authored-by: Aryan VS <contact.aryanvs@gmail.com>
---
 docs/source/en/optimization/memory.md | 7 +++++++
 1 file changed, 7 insertions(+)
diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md
index 5b3bfe650d74..8e4edbb89090 100644
--- a/docs/source/en/optimization/memory.md
+++ b/docs/source/en/optimization/memory.md
@@ -295,6 +295,13 @@ pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_d
 
 The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usage when using streams during group offloading. It is best for `leaf_level` offloading and when CPU memory is bottlenecked. Memory is saved by creating pinned tensors on the fly instead of pre-pinning them. However, this may increase overall execution time.
 
+<Tip>
+
+The offloading strategies can be combined with [quantization](../quantization/overview.md) to enable further memory savings. For image generation, combining [quantization and model offloading](#model-offloading) can often give the best trade-off between quality, speed, and memory. However, for video generation, as the models are more
+compute-bound, [group-offloading](#group-offloading) tends to be better. Group offloading benefits considerably from overlapping weight transfers and computation. When applying group offloading with quantization on image generation models at typical resolutions (1024x1024, for example), it usually cannot overlap weight transfer if the compute kernel finishes before weight transfer, making it communication bound between CPU/GPU.
+
+</Tip>
+
 ## Layerwise casting
 
 Layerwise casting stores weights in a smaller data format (for example, `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision like `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.

From 67505577c2cdd738d9039f8328ee35dd7b6c2e7c Mon Sep 17 00:00:00 2001
From: Sayak Paul <spsayakpaul@gmail.com>
Date: Mon, 19 May 2025 14:42:21 +0530
Subject: [PATCH 2/2] Apply suggestions from code review

Co-authored-by: Aryan <aryan@huggingface.co>
---
 docs/source/en/optimization/memory.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md
index 8e4edbb89090..6b853a7a084b 100644
--- a/docs/source/en/optimization/memory.md
+++ b/docs/source/en/optimization/memory.md
@@ -298,7 +298,7 @@ The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usag
 <Tip>
 
 The offloading strategies can be combined with [quantization](../quantization/overview.md) to enable further memory savings. For image generation, combining [quantization and model offloading](#model-offloading) can often give the best trade-off between quality, speed, and memory. However, for video generation, as the models are more
-compute-bound, [group-offloading](#group-offloading) tends to be better. Group offloading benefits considerably from overlapping weight transfers and computation. When applying group offloading with quantization on image generation models at typical resolutions (1024x1024, for example), it usually cannot overlap weight transfer if the compute kernel finishes before weight transfer, making it communication bound between CPU/GPU.
+compute-bound, [group-offloading](#group-offloading) tends to be better. Group offloading provides considerable benefits when weight transfers can be overlapped with computation (must use streams). When applying group offloading with quantization on image generation models at typical resolutions (1024x1024, for example), it is usually not possible to *fully* overlap weight transfers if the compute kernel finishes faster, making it communication bound between CPU/GPU (due to device synchronizations).
 
 </Tip>