Skip to content

Commit 06b411f

Browse files
committed
improve docs
1 parent 8f10d05 commit 06b411f

File tree

2 files changed

+7
-7
lines changed

2 files changed

+7
-7
lines changed

docs/source/en/optimization/memory.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -160,9 +160,9 @@ In order to properly offload models after they're called, it is required to run
160160

161161
## Group offloading
162162

163-
Group offloading is a middle ground between the two above methods. It works by offloading groups of internal layers (either `torch.nn.ModuleList` or `torch.nn.Sequential`). This method is more memory-efficient than model-level offloading. It is also faster than sequential-level offloading, as the number of device synchronizations is reduced.
163+
Group offloading is a middle ground between the two above methods. It works by offloading groups of internal layers (either `torch.nn.ModuleList` or `torch.nn.Sequential`). This method uses lower memory than model-level offloading. It is also faster than sequential-level offloading, as the number of device synchronizations is reduced.
164164

165-
Another supported feature (for CUDA devices with support for asynchronous data transfer streams) is the ability to overlap data transfer and computation to reduce the overall execution time. This is enabled using layer prefetching with CUDA streams, i.e., the layer that is to be executed next starts onloading to the accelerator device while the current layer is being executed - this increases the memory requirements slightly. Note that this implementation also supports leaf-level offloading but can be made much faster when using streams.
165+
Another supported feature (for CUDA devices with support for asynchronous data transfer streams) is the ability to overlap data transfer and computation to reduce the overall execution time compared to sequential offloading. This is enabled using layer prefetching with CUDA streams, i.e., the layer that is to be executed next starts onloading to the accelerator device while the current layer is being executed - this increases the memory requirements slightly. Note that this implementation also supports leaf-level offloading but can be made much faster when using streams.
166166

167167
To enable group offloading, either call the [`~ModelMixin.enable_group_offloading`] method on the model or pass use [`~hooks.group_offloading.apply_group_offloading`]:
168168

src/diffusers/hooks/group_offloading.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -285,14 +285,14 @@ def apply_group_offloading(
285285
memory, but can be slower due to the excessive number of device synchronizations.
286286
287287
Group offloading is a middle ground between the two methods. It works by offloading groups of internal layers,
288-
(either `torch.nn.ModuleList` or `torch.nn.Sequential`). This method is more memory-efficient than module-level
288+
(either `torch.nn.ModuleList` or `torch.nn.Sequential`). This method uses lower memory than module-level
289289
offloading. It is also faster than leaf-level offloading, as the number of device synchronizations is reduced.
290290
291291
Another supported feature (for CUDA devices with support for asynchronous data transfer streams) is the ability to
292-
overlap data transfer and computation to reduce the overall execution time. This is enabled using layer prefetching
293-
with streams, i.e., the layer that is to be executed next starts onloading to the accelerator device while the
294-
current layer is being executed - this increases the memory requirements slightly. Note that this implementation
295-
also supports leaf-level offloading but can be made much faster when using streams.
292+
overlap data transfer and computation to reduce the overall execution time compared to sequential offloading. This
293+
is enabled using layer prefetching with streams, i.e., the layer that is to be executed next starts onloading to
294+
the accelerator device while the current layer is being executed - this increases the memory requirements slightly.
295+
Note that this implementation also supports leaf-level offloading but can be made much faster when using streams.
296296
297297
Args:
298298
module (`torch.nn.Module`):

0 commit comments

Comments
 (0)