From efef58c2be6eb5d3f72831d6a8f4002d95958a9c Mon Sep 17 00:00:00 2001 From: stevhliu Date: Wed, 28 May 2025 13:21:22 -0700 Subject: [PATCH 1/2] cache --- docs/source/en/_toctree.yml | 2 + docs/source/en/api/cache.md | 60 ++----------------------- docs/source/en/optimization/cache.md | 66 ++++++++++++++++++++++++++++ 3 files changed, 72 insertions(+), 56 deletions(-) create mode 100644 docs/source/en/optimization/cache.md diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index e9cea85ffc0b..0d6d3aee5a6a 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -178,6 +178,8 @@ - sections: - local: optimization/fp16 title: Accelerate inference + - local: optimization/cache + title: Caching - local: optimization/memory title: Reduce memory usage - local: optimization/xformers diff --git a/docs/source/en/api/cache.md b/docs/source/en/api/cache.md index a6aa5445a845..f156d6c977f3 100644 --- a/docs/source/en/api/cache.md +++ b/docs/source/en/api/cache.md @@ -11,71 +11,19 @@ specific language governing permissions and limitations under the License. --> # Caching methods -## Pyramid Attention Broadcast +Cache methods speedup diffusion transformers by storing and reusing attention states instead of recalculating them. -[Pyramid Attention Broadcast](https://huggingface.co/papers/2408.12588) from Xuanlei Zhao, Xiaolong Jin, Kai Wang, Yang You. - -Pyramid Attention Broadcast (PAB) is a method that speeds up inference in diffusion models by systematically skipping attention computations between successive inference steps and reusing cached attention states. The attention states are not very different between successive inference steps. The most prominent difference is in the spatial attention blocks, not as much in the temporal attention blocks, and finally the least in the cross attention blocks. Therefore, many cross attention computation blocks can be skipped, followed by the temporal and spatial attention blocks. By combining other techniques like sequence parallelism and classifier-free guidance parallelism, PAB achieves near real-time video generation. - -Enable PAB with [`~PyramidAttentionBroadcastConfig`] on any pipeline. For some benchmarks, refer to [this](https://github.com/huggingface/diffusers/pull/9562) pull request. - -```python -import torch -from diffusers import CogVideoXPipeline, PyramidAttentionBroadcastConfig - -pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16) -pipe.to("cuda") - -# Increasing the value of `spatial_attention_timestep_skip_range[0]` or decreasing the value of -# `spatial_attention_timestep_skip_range[1]` will decrease the interval in which pyramid attention -# broadcast is active, leader to slower inference speeds. However, large intervals can lead to -# poorer quality of generated videos. -config = PyramidAttentionBroadcastConfig( - spatial_attention_block_skip_range=2, - spatial_attention_timestep_skip_range=(100, 800), - current_timestep_callback=lambda: pipe.current_timestep, -) -pipe.transformer.enable_cache(config) -``` - -## Faster Cache - -[FasterCache](https://huggingface.co/papers/2410.19355) from Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, Kwan-Yee K. Wong. - -FasterCache is a method that speeds up inference in diffusion transformers by: -- Reusing attention states between successive inference steps, due to high similarity between them -- Skipping unconditional branch prediction used in classifier-free guidance by revealing redundancies between unconditional and conditional branch outputs for the same timestep, and therefore approximating the unconditional branch output using the conditional branch output - -```python -import torch -from diffusers import CogVideoXPipeline, FasterCacheConfig - -pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16) -pipe.to("cuda") - -config = FasterCacheConfig( - spatial_attention_block_skip_range=2, - spatial_attention_timestep_skip_range=(-1, 681), - current_timestep_callback=lambda: pipe.current_timestep, - attention_weight_callback=lambda _: 0.3, - unconditional_batch_skip_range=5, - unconditional_batch_timestep_skip_range=(-1, 781), - tensor_format="BFCHW", -) -pipe.transformer.enable_cache(config) -``` - -### CacheMixin +## CacheMixin [[autodoc]] CacheMixin -### PyramidAttentionBroadcastConfig +## PyramidAttentionBroadcastConfig [[autodoc]] PyramidAttentionBroadcastConfig [[autodoc]] apply_pyramid_attention_broadcast -### FasterCacheConfig +## FasterCacheConfig [[autodoc]] FasterCacheConfig diff --git a/docs/source/en/optimization/cache.md b/docs/source/en/optimization/cache.md new file mode 100644 index 000000000000..a96d3ccaff37 --- /dev/null +++ b/docs/source/en/optimization/cache.md @@ -0,0 +1,66 @@ + + +# Caching + +Caching accelerates inference by storing and reusing redundant attention outputs instead of performing extra computation. It significantly improves efficiency and doesn't require additional training. + +This guide shows you how to use the caching methods supported in Diffusers. + +## Pyramid Attention Broadcast + +[Pyramid Attention Broadcast (PAB)](https://huggingface.co/papers/2408.12588) is based on the observation that many of the attention output differences are redundant. The attention differences are smallest in the cross attention block so the cached attention states are broadcasted and reused over a longer range. This is followed by temporal attention and finally spatial attention. + +PAB can be combined with other techniques like sequence parallelism and classifier-free guidance parallelism for near real-time video generation. + +Set up and pass a [`PyramidAttentionBroadcastConfig`] to a pipeline's transformer to enable it. The `spatial_attention_block_skip_range` controls how often to skip attention calculations in the spatial attention blocks and the `spatial_attention_timestep_skip_range` is the range of timesteps to skip. Take care to choose an appropriate range because a smaller interval can lead to slower inference speeds and a larger interval can result in lower generation quality. + +```python +import torch +from diffusers import CogVideoXPipeline, PyramidAttentionBroadcastConfig + +pipeline = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16) +pipeline.to("cuda") + +config = PyramidAttentionBroadcastConfig( + spatial_attention_block_skip_range=2, + spatial_attention_timestep_skip_range=(100, 800), + current_timestep_callback=lambda: pipe.current_timestep, +) +pipeline.transformer.enable_cache(config) +``` + +## FasterCache + +[FasterCache](https://huggingface.co/papers/2410.19355) computes and caches attention features at every other timestep instead of directly reusing cached features because it can cause flickering or blurry details in the generated video. The features from the skipped step are calculated from the difference between the adjacent cached features. + +FasterCache also uses a classifier-free guidance (CFG) cache which computes both the conditional and unconditional outputs once. For future timesteps, only the conditional output is calculated and the unconditional output is estimated from the cached biases. + +Set up and pass a [`FasterCacheConfig`] to a pipeline's transformer to enable it. + +```python +import torch +from diffusers import CogVideoXPipeline, FasterCacheConfig + +pipe line= CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16) +pipeline.to("cuda") + +config = FasterCacheConfig( + spatial_attention_block_skip_range=2, + spatial_attention_timestep_skip_range=(-1, 681), + current_timestep_callback=lambda: pipe.current_timestep, + attention_weight_callback=lambda _: 0.3, + unconditional_batch_skip_range=5, + unconditional_batch_timestep_skip_range=(-1, 781), + tensor_format="BFCHW", +) +pipeline.transformer.enable_cache(config) +``` \ No newline at end of file From 7db872e7ee70c88ff26dc8165186d5ce7a372d48 Mon Sep 17 00:00:00 2001 From: stevhliu Date: Mon, 2 Jun 2025 09:59:43 -0700 Subject: [PATCH 2/2] feedback --- docs/source/en/api/cache.md | 2 +- docs/source/en/optimization/cache.md | 13 ++++++++----- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/docs/source/en/api/cache.md b/docs/source/en/api/cache.md index f156d6c977f3..f5510867310c 100644 --- a/docs/source/en/api/cache.md +++ b/docs/source/en/api/cache.md @@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. --> # Caching methods -Cache methods speedup diffusion transformers by storing and reusing attention states instead of recalculating them. +Cache methods speedup diffusion transformers by storing and reusing intermediate outputs of specific layers, such as attention and feedforward layers, instead of recalculating them at each inference step. ## CacheMixin diff --git a/docs/source/en/optimization/cache.md b/docs/source/en/optimization/cache.md index a96d3ccaff37..ea510aed66a9 100644 --- a/docs/source/en/optimization/cache.md +++ b/docs/source/en/optimization/cache.md @@ -11,15 +11,18 @@ specific language governing permissions and limitations under the License. --> # Caching -Caching accelerates inference by storing and reusing redundant attention outputs instead of performing extra computation. It significantly improves efficiency and doesn't require additional training. +Caching accelerates inference by storing and reusing intermediate outputs of different layers, such as attention and feedforward layers, instead of performing the entire computation at each inference step. It significantly improves generation speed at the expense of more memory and doesn't require additional training. This guide shows you how to use the caching methods supported in Diffusers. ## Pyramid Attention Broadcast -[Pyramid Attention Broadcast (PAB)](https://huggingface.co/papers/2408.12588) is based on the observation that many of the attention output differences are redundant. The attention differences are smallest in the cross attention block so the cached attention states are broadcasted and reused over a longer range. This is followed by temporal attention and finally spatial attention. +[Pyramid Attention Broadcast (PAB)](https://huggingface.co/papers/2408.12588) is based on the observation that attention outputs aren't that different between successive timesteps of the generation process. The attention differences are smallest in the cross attention layers and are generally cached over a longer timestep range. This is followed by temporal attention and spatial attention layers. -PAB can be combined with other techniques like sequence parallelism and classifier-free guidance parallelism for near real-time video generation. +> [!TIP] +> Not all video models have three types of attention (cross, temporal, and spatial)! + +PAB can be combined with other techniques like sequence parallelism and classifier-free guidance parallelism (data parallelism) for near real-time video generation. Set up and pass a [`PyramidAttentionBroadcastConfig`] to a pipeline's transformer to enable it. The `spatial_attention_block_skip_range` controls how often to skip attention calculations in the spatial attention blocks and the `spatial_attention_timestep_skip_range` is the range of timesteps to skip. Take care to choose an appropriate range because a smaller interval can lead to slower inference speeds and a larger interval can result in lower generation quality. @@ -40,9 +43,9 @@ pipeline.transformer.enable_cache(config) ## FasterCache -[FasterCache](https://huggingface.co/papers/2410.19355) computes and caches attention features at every other timestep instead of directly reusing cached features because it can cause flickering or blurry details in the generated video. The features from the skipped step are calculated from the difference between the adjacent cached features. +[FasterCache](https://huggingface.co/papers/2410.19355) caches and reuses attention features similar to [PAB](#pyramid-attention-broadcast) since output differences are small for each successive timestep. -FasterCache also uses a classifier-free guidance (CFG) cache which computes both the conditional and unconditional outputs once. For future timesteps, only the conditional output is calculated and the unconditional output is estimated from the cached biases. +This method may also choose to skip the unconditional branch prediction, when using classifier-free guidance for sampling (common in most base models), and estimate it from the conditional branch prediction if there is significant redundancy in the predicted latent outputs between successive timesteps. Set up and pass a [`FasterCacheConfig`] to a pipeline's transformer to enable it.