feedback

stevhliu · stevhliu · commit 6b58810ca2c4 · 2025-09-22T13:43:29.000-07:00
diff --git a/docs/source/en/optimization/attention_backends.md b/docs/source/en/optimization/attention_backends.md
@@ -22,17 +22,16 @@ Available attention implementations include the following.
 | PyTorch native | built-in PyTorch implementation using [scaled_dot_product_attention](./fp16#scaled-dot-product-attention) |
 | xFormers | memory-efficient attention with support for various attention kernels |
 
-This guide will show you how to use the dispatcher to set and use the different attention backends.
+This guide will show you how to set and use the different attention backends.
 
-## FlashAttention
+## set_attention_backend
 
-[FlashAttention](https://github.com/Dao-AILab/flash-attention) reduces memory traffic by making better use of on-chip shared memory (SRAM) instead of global GPU memory so the data doesn't have to travel far. The latest variant, FlashAttention-3, is further optimized for modern GPUs (Hopper/Blackwell) and also overlaps computations and handles FP8 attention better.
+The [`~ModelMixin.set_attention_backend`] method iterates through all the modules in the model and sets the appropriate attention backend to use. The attention backend setting persists until [`~ModelMixin.reset_attention_backend`] is called.
 
-There are several available FlashAttention variants, including variable length and the original FlashAttention. For a full list of supported implementations, check the list [here](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L163).
+The example below demonstrates how to enable the `_flash_3_hub` implementation for FlashAttention-3 from the [kernel](https://github.com/huggingface/kernels) library, which allows you to instantly use optimized compute kernels from the Hub without requiring any setup.
 
-The example below demonstrates how to enable the `_flash_3_hub` implementation. The [kernel](https://github.com/huggingface/kernels) library allows you to instantly use optimized compute kernels from the Hub without requiring any setup.
-
-Pass the attention backend to the [`~ModelMixin.set_attention_backend`] method.
+> [!TIP]
+> FlashAttention-3 is not supported for non-Hopper architectures, in which case, use FlashAttention (set_attention_backend("flash")).
 
 ```py
 import torch
@@ -44,129 +43,15 @@ pipeline = QwenImagePipeline.from_pretrained(
 pipeline.transformer.set_attention_backend("_flash_3_hub")
 ```
 
-You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
-
-```py
-import torch
-from diffusers import QwenImagePipeline
-
-pipeline = QwenImagePipeline.from_pretrained(
-    "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
-)
-prompt = """
-cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
-highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
-"""
-
-with attention_backend("_flash_3_hub"):
-    image = pipeline(prompt).images[0]
-```
-
-To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
-
-```py
-pipeline.transformer.reset_attention_backend()
-```
-
-## SageAttention
-
-[SageAttention](https://github.com/thu-ml/SageAttention) quantizes attention by computing queries (Q) and keys (K) in INT8. The probability (P) and value (V) are calculated in either FP8 or FP16 to minimize error. This significantly increases inference throughput and with little to no degradation.
-
-There are several SageAttention variants for FP8 and FP16 as well as whether it is CUDA or Triton based. For a full list of supported implementations, check the list [here](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L182).
-
-The example below uses the `_sage_qk_int8_pv_fp8_cuda` implementation.
-
-```py
-import torch
-from diffusers import QwenImagePipeline
-
-pipeline = QwenImagePipeline.from_pretrained(
-    "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
-)
-pipeline.transformer.set_attention_backend("_sage_qk_int8_pv_fp8_cuda")
-```
-
-You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
-
-```py
-import torch
-from diffusers import QwenImagePipeline
-
-pipeline = QwenImagePipeline.from_pretrained(
-    "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
-)
-prompt = """
-cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
-highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
-"""
-
-with attention_backend("_sage_qk_int8_pv_fp8_cuda"):
-    image = pipeline(prompt).images[0]
-```
-
-To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
-
-```py
-pipeline.transformer.reset_attention_backend()
-```
-
-## PyTorch native
-
-PyTorch includes a [native implementation](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) of several optimized attention implementations including [FlexAttention](https://pytorch.org/blog/flexattention/), FlashAttention, memory-efficient attention, and a C++ version.
-
-For a full list of supported implementations, check the list [here](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L171).
-
-The example below uses the `_native_flash` implementation.
-
-```py
-import torch
-from diffusers import QwenImagePipeline
-
-pipeline = QwenImagePipeline.from_pretrained(
-    "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
-)
-pipeline.transformer.set_attention_backend("_native_flash")
-```
-
-You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
-
-```py
-import torch
-from diffusers import QwenImagePipeline
-
-pipeline = QwenImagePipeline.from_pretrained(
-    "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
-)
-prompt = """
-cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
-highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
-"""
-
-with attention_backend("_native_flash"):
-    image = pipeline(prompt).images[0]
-```
-
 To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
 
 ```py
 pipeline.transformer.reset_attention_backend()
 ```
 
-## xFormers
-
-[xFormers](https://github.com/facebookresearch/xformers) provides memory-efficient attention algorithms such as sparse attention and block-sparse attention. Pass `xformers` to enable it.
-
-```py
-import torch
-from diffusers import QwenImagePipeline
-
-pipeline = QwenImagePipeline.from_pretrained(
-    "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
-)
-pipeline.transformer.set_attention_backend("xformers")
-```
+## attention_backend context manager
 
-You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
+The [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager temporarily sets an attention backend for a model within the context. Outside the context, the default attention (PyTorch's native scaled dot product attention) is used. This is useful if you want to use different backends for different parts of a pipeline or if you want to test the different backends.
 
 ```py
 import torch
@@ -180,12 +65,33 @@ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, Cal
 highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
 """
 
-with attention_backend("xformers"):
+with attention_backend("_flash_3_hub"):
     image = pipeline(prompt).images[0]
 ```
 
-To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
-
-```py
-pipeline.transformer.reset_attention_backend()
-```
+## Available backends
+
+Refer to the table below for available attention backends.
+
+| Backend Name | Family | Description |
+|--------------|--------|-------------|
+| `native` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Default backend using PyTorch's scaled_dot_product_attention |
+| `flex` | [FlexAttention](https://docs.pytorch.org/docs/stable/nn.attention.flex_attention.html#module-torch.nn.attention.flex_attention) | PyTorch FlexAttention implementation |
+| `_native_cudnn` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | CuDNN-optimized attention |
+| `_native_efficient` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Memory-efficient attention |
+| `_native_flash` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | PyTorch's FlashAttention |
+| `_native_math` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Math-based attention (fallback) |
+| `_native_npu` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | NPU-optimized attention |
+| `_native_xla` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | XLA-optimized attention |
+| `flash` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-2 |
+| `flash_varlen` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention |
+| `_flash_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 |
+| `_flash_varlen_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention-3 |
+| `_flash_3_hub` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 from kernels |
+| `sage` | [SageAttention](https://github.com/thu-ml/SageAttention) | Quantized attention (INT8 QK) |
+| `sage_varlen` | [SageAttention](https://github.com/thu-ml/SageAttention) | Variable length SageAttention |
+| `_sage_qk_int8_pv_fp8_cuda` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP8 PV (CUDA) |
+| `_sage_qk_int8_pv_fp8_cuda_sm90` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP8 PV (SM90) |
+| `_sage_qk_int8_pv_fp16_cuda` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (CUDA) |
+| `_sage_qk_int8_pv_fp16_triton` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (Triton) |
+| `xformers` | [xFormers](https://github.com/facebookresearch/xformers) | Memory-efficient attention |