You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -22,17 +22,16 @@ Available attention implementations include the following.
22
22
| PyTorch native | built-in PyTorch implementation using [scaled_dot_product_attention](./fp16#scaled-dot-product-attention)|
23
23
| xFormers | memory-efficient attention with support for various attention kernels |
24
24
25
-
This guide will show you how to use the dispatcher to set and use the different attention backends.
25
+
This guide will show you how to set and use the different attention backends.
26
26
27
-
## FlashAttention
27
+
## set_attention_backend
28
28
29
-
[FlashAttention](https://github.com/Dao-AILab/flash-attention) reduces memory traffic by making better use of on-chip shared memory (SRAM) instead of global GPU memory so the data doesn't have to travel far. The latest variant, FlashAttention-3, is further optimized for modern GPUs (Hopper/Blackwell) and also overlaps computations and handles FP8 attention better.
29
+
The [`~ModelMixin.set_attention_backend`] method iterates through all the modules in the model and sets the appropriate attention backend to use. The attention backend setting persists until [`~ModelMixin.reset_attention_backend`] is called.
30
30
31
-
There are several available FlashAttention variants, including variable length and the original FlashAttention. For a full list of supported implementations, check the list [here](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L163).
31
+
The example below demonstrates how to enable the `_flash_3_hub` implementation for FlashAttention-3 from the [kernel](https://github.com/huggingface/kernels) library, which allows you to instantly use optimized compute kernels from the Hub without requiring any setup.
32
32
33
-
The example below demonstrates how to enable the `_flash_3_hub` implementation. The [kernel](https://github.com/huggingface/kernels) library allows you to instantly use optimized compute kernels from the Hub without requiring any setup.
34
-
35
-
Pass the attention backend to the [`~ModelMixin.set_attention_backend`] method.
33
+
> [!TIP]
34
+
> FlashAttention-3 is not supported for non-Hopper architectures, in which case, use FlashAttention (set_attention_backend("flash")).
You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
58
-
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
59
-
"""
60
-
61
-
with attention_backend("_flash_3_hub"):
62
-
image = pipeline(prompt).images[0]
63
-
```
64
-
65
-
To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
66
-
67
-
```py
68
-
pipeline.transformer.reset_attention_backend()
69
-
```
70
-
71
-
## SageAttention
72
-
73
-
[SageAttention](https://github.com/thu-ml/SageAttention) quantizes attention by computing queries (Q) and keys (K) in INT8. The probability (P) and value (V) are calculated in either FP8 or FP16 to minimize error. This significantly increases inference throughput and with little to no degradation.
74
-
75
-
There are several SageAttention variants for FP8 and FP16 as well as whether it is CUDA or Triton based. For a full list of supported implementations, check the list [here](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L182).
76
-
77
-
The example below uses the `_sage_qk_int8_pv_fp8_cuda` implementation.
You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
100
-
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
101
-
"""
102
-
103
-
with attention_backend("_sage_qk_int8_pv_fp8_cuda"):
104
-
image = pipeline(prompt).images[0]
105
-
```
106
-
107
-
To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
108
-
109
-
```py
110
-
pipeline.transformer.reset_attention_backend()
111
-
```
112
-
113
-
## PyTorch native
114
-
115
-
PyTorch includes a [native implementation](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) of several optimized attention implementations including [FlexAttention](https://pytorch.org/blog/flexattention/), FlashAttention, memory-efficient attention, and a C++ version.
116
-
117
-
For a full list of supported implementations, check the list [here](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L171).
118
-
119
-
The example below uses the `_native_flash` implementation.
You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
142
-
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
143
-
"""
144
-
145
-
with attention_backend("_native_flash"):
146
-
image = pipeline(prompt).images[0]
147
-
```
148
-
149
46
To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
150
47
151
48
```py
152
49
pipeline.transformer.reset_attention_backend()
153
50
```
154
51
155
-
## xFormers
156
-
157
-
[xFormers](https://github.com/facebookresearch/xformers) provides memory-efficient attention algorithms such as sparse attention and block-sparse attention. Pass `xformers` to enable it.
You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
54
+
The [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager temporarily sets an attention backend for a model within the context. Outside the context, the default attention (PyTorch's native scaled dot product attention) is used. This is useful if you want to use different backends for different parts of a pipeline or if you want to test the different backends.
170
55
171
56
```py
172
57
import torch
@@ -180,12 +65,33 @@ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, Cal
180
65
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
181
66
"""
182
67
183
-
with attention_backend("xformers"):
68
+
with attention_backend("_flash_3_hub"):
184
69
image = pipeline(prompt).images[0]
185
70
```
186
71
187
-
To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
188
-
189
-
```py
190
-
pipeline.transformer.reset_attention_backend()
191
-
```
72
+
## Available backends
73
+
74
+
Refer to the table below for available attention backends.
75
+
76
+
| Backend Name | Family | Description |
77
+
|--------------|--------|-------------|
78
+
|`native`|[PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend)| Default backend using PyTorch's scaled_dot_product_attention |
0 commit comments