-
Couldn't load subscription status.
- Fork 6.5k
[docs] slight edits to the attention backends docs. #12394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. --> | |
|
|
||
| # Attention backends | ||
|
|
||
| > [!TIP] | ||
| > [!NOTE] | ||
| > The attention dispatcher is an experimental feature. Please open an issue if you have any feedback or encounter any problems. | ||
|
|
||
| Diffusers provides several optimized attention algorithms that are more memory and computationally efficient through it's *attention dispatcher*. The dispatcher acts as a router for managing and switching between different attention implementations and provides a unified interface for interacting with them. | ||
|
|
@@ -33,7 +33,7 @@ The [`~ModelMixin.set_attention_backend`] method iterates through all the module | |
|
|
||
| The example below demonstrates how to enable the `_flash_3_hub` implementation for FlashAttention-3 from the [kernel](https://github.com/huggingface/kernels) library, which allows you to instantly use optimized compute kernels from the Hub without requiring any setup. | ||
|
|
||
| > [!TIP] | ||
| > [!NOTE] | ||
| > FlashAttention-3 is not supported for non-Hopper architectures, in which case, use FlashAttention with `set_attention_backend("flash")`. | ||
|
|
||
| ```py | ||
|
|
@@ -78,10 +78,16 @@ with attention_backend("_flash_3_hub"): | |
| image = pipeline(prompt).images[0] | ||
| ``` | ||
|
|
||
| > [!TIP] | ||
| > Most of these attention backends come with `torch.compile` compatibility without any graph breaks. Consider using it for maximum speedups. | ||
|
|
||
| ## Available backends | ||
|
|
||
| Refer to the table below for a complete list of available attention backends and their variants. | ||
|
|
||
| <details> | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. May be simpler to replace the table at the beginning of the docs with this more complete one, wdyt? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually the first table is nice in the sense that it provides a consolidated overview of what's supported. So, I'd prefer to keep it. |
||
| <summary>Expand</summary> | ||
|
|
||
| | Backend Name | Family | Description | | ||
| |--------------|--------|-------------| | ||
| | `native` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Default backend using PyTorch's scaled_dot_product_attention | | ||
|
|
@@ -104,3 +110,5 @@ Refer to the table below for a complete list of available attention backends and | |
| | `_sage_qk_int8_pv_fp16_cuda` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (CUDA) | | ||
| | `_sage_qk_int8_pv_fp16_triton` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (Triton) | | ||
| | `xformers` | [xFormers](https://github.com/facebookresearch/xformers) | Memory-efficient attention | | ||
|
|
||
| </details> | ||
Uh oh!
There was an error while loading. Please reload this page.